Data Mining
Data Mining
JNU, Jaipur
First Edition 2013
JNU makes reasonable endeavours to ensure content is current and accurate. JNU reserves the right to alter the
content whenever the need arises, and to vary it at any time without prior notice.
Index
I. Content....................................................................... II
IV. Abbreviations........................................................... X
Book at a Glance
I/JNU OLE
Contents
Chapter I........................................................................................................................................................ 1
Data Warehouse – Need, Planning and Architecture................................................................................ 1
Aim................................................................................................................................................................. 1
Objectives....................................................................................................................................................... 1
Learning outcome........................................................................................................................................... 1
1.1 Introduction . ............................................................................................................................................ 2
1.2 Need for Data Warehousing...................................................................................................................... 4
1.3 Basic Elements of Data Warehousing....................................................................................................... 5
1.4 Project Planning and Management........................................................................................................... 6
1.5 Architecture and Infrastructure................................................................................................................. 8
1.5.1 Infrastructure............................................................................................................................11
1.5.2 Metadata.................................................................................................................................. 13
1.5.3 Metadata Components............................................................................................................ 14
Summary...................................................................................................................................................... 17
References.................................................................................................................................................... 17
Recommended Reading.............................................................................................................................. 17
Self Assessment............................................................................................................................................ 18
Chapter II.................................................................................................................................................... 20
Data Design and Data Representation...................................................................................................... 20
Aim............................................................................................................................................................... 20
Objectives..................................................................................................................................................... 20
Learning outcome......................................................................................................................................... 20
2.1 Introduction............................................................................................................................................. 21
2.2 Design Decision...................................................................................................................................... 21
2.3 Use of CASE Tools................................................................................................................................. 21
2.4 Star Schema............................................................................................................................................ 23
2.4.1 Review of a Simple STAR Schema........................................................................................ 23
2.4.2 Star Schema Keys................................................................................................................... 24
2.5 Dimensional Modelling.......................................................................................................................... 26
2.5.1 E-R Modelling versus Dimensional Modelling...................................................................... 26
2.6 Data Extraction....................................................................................................................................... 26
2.6.1 Source Identification............................................................................................................... 27
2.6.2 Data Extraction Techniques.................................................................................................... 28
2.6.3 Data in Operational Systems................................................................................................... 28
2.7 Data Transformation............................................................................................................................... 33
2.7.1 Major Transformation Types................................................................................................... 34
2.7.2 Data Integration and Consolidation........................................................................................ 36
2.7.3 Implementing Transformation................................................................................................ 37
2.8 Data Loading........................................................................................................................................... 38
2.9 Data Quality............................................................................................................................................ 39
2.10 Information Access and Delivery.......................................................................................................... 40
2.11 Matching Information to Classes of Users OLAP in Data Warehouse................................................. 40
2.11.1 Information from the Data Warehouse.................................................................................. 41
2.11.2 Information Potential............................................................................................................ 41
Summary...................................................................................................................................................... 43
References.................................................................................................................................................... 43
Recommended Reading.............................................................................................................................. 43
Self Assessment............................................................................................................................................ 44
II/JNU OLE
Chapter III................................................................................................................................................... 46
Data Mining................................................................................................................................................. 46
Aim............................................................................................................................................................... 46
Objectives..................................................................................................................................................... 46
Learning outcome......................................................................................................................................... 46
3.1 Introduction............................................................................................................................................. 47
3.2 Crucial Concepts of Data Mining........................................................................................................... 48
3.2.1 Bagging (Voting, Averaging).................................................................................................. 48
3.2.2 Boosting.................................................................................................................................. 49
3.2.3 Data Preparation (in Data Mining)......................................................................................... 49
3.2.4 Data Reduction (for Data Mining).......................................................................................... 49
3.2.5 Deployment............................................................................................................................. 49
3.2.6 Drill-Down Analysis............................................................................................................... 50
3.2.7 Feature Selection..................................................................................................................... 50
3.2.8 Machine Learning................................................................................................................... 50
3.2.9 Meta-Learning........................................................................................................................ 50
3.2.10 Models for Data Mining....................................................................................................... 50
3.2.11 Predictive Data Mining......................................................................................................... 52
3.2.12 Text Mining........................................................................................................................... 52
3.3 Cross-Industry Standard Process: Crisp–Dm.......................................................................................... 52
3.3.1 CRISP-DM: The Six Phases................................................................................................... 53
3.4 Data Mining Techniques......................................................................................................................... 55
3.5 Graph Mining.......................................................................................................................................... 55
3.6 Social Network Analysis......................................................................................................................... 56
3.6.1 Characteristics of Social Networks......................................................................................... 56
3.6.2 Mining on Social Networks.................................................................................................... 57
3.7 Multirelational Data Mining................................................................................................................... 59
3.8 Data Mining Algorithms and their Types................................................................................................ 60
3.8.1 Classification.......................................................................................................................... 61
3.8.2 Clustering................................................................................................................................ 69
3.8.3 Association Rules.................................................................................................................... 77
Summary...................................................................................................................................................... 80
References.................................................................................................................................................... 80
Recommended Reading.............................................................................................................................. 81
Self Assessment............................................................................................................................................ 82
Chapter IV................................................................................................................................................... 84
Web Application of Data Mining............................................................................................................... 84
Aim............................................................................................................................................................... 84
Objectives..................................................................................................................................................... 84
Learning outcome......................................................................................................................................... 84
4.1 Introduction............................................................................................................................................. 85
4.2 Goals of Data Mining and Knowledge Discovery.................................................................................. 86
4.3 Types of Knowledge Discovered during Data Mining........................................................................... 86
4.4 Knowledge Discovery Process............................................................................................................... 87
4.4.1 Overview of Knowledge Discovery Process.......................................................................... 88
4.5 Web Mining............................................................................................................................................. 90
4.5.1 Web Analysis.......................................................................................................................... 90
4.5.2 Benefits of Web mining.......................................................................................................... 91
4.6 Web Content Mining............................................................................................................................... 91
4.7 Web StructureMining.............................................................................................................................. 92
4.8 Web Usage Mining.................................................................................................................................. 93
III/JNU OLE
Summary...................................................................................................................................................... 95
References.................................................................................................................................................... 95
Recommended Reading.............................................................................................................................. 95
Self Assessment............................................................................................................................................ 96
Chapter V..................................................................................................................................................... 98
Advance topics of Data Mining.................................................................................................................. 98
Aim............................................................................................................................................................... 98
Objectives..................................................................................................................................................... 98
Learning outcome......................................................................................................................................... 98
5.1 Introduction............................................................................................................................................. 99
5.2 Concepts.................................................................................................................................................. 99
5.2.1 Mechanism............................................................................................................................ 100
5.2.2 Knowledge to be Discovered................................................................................................ 100
5.3 Techniques of SDMKD......................................................................................................................... 101
5.3.1 SDMKD-based Image Classification.................................................................................... 103
5.3.2 Cloud Model......................................................................................................................... 104
5.3.3 Data Fields............................................................................................................................ 105
5.4 Design- and Model-based Approaches to Spatial Sampling................................................................. 106
5.4.1 Design-based Approach to Sampling.................................................................................... 106
5.4.2 Model-based Approach to Sampling..................................................................................... 107
5.5 Temporal Mining................................................................................................................................... 107
5.5.1 Time in Data Warehouses..................................................................................................... 108
5.5.2 Temporal Constraints and Temporal Relations..................................................................... 108
5.5.3 Requirements for a Temporal Knowledge-Based Management System.............................. 108
5.6 Database Mediators............................................................................................................................... 108
5.6.1 Temporal Relation Discovery............................................................................................... 109
5.6.2 Semantic Queries on Temporal Data.................................................................................... 109
5.7 Temporal Data Types.............................................................................................................................110
5.8 Temporal Data Processing.....................................................................................................................110
5.8.1 Data Normalisation................................................................................................................111
5.9 Temporal Event Representation.............................................................................................................111
5.9.1 Event Representation Using Markov Models........................................................................111
5.9.2 A Formalism for Temporal Objects and Repetitions..............................................................112
5.10 Classification Techniques....................................................................................................................112
5.10.1 Distance-Based Classifier....................................................................................................112
5.10.2 Bayes Classifier...................................................................................................................112
5.10.3 Decision Tree.......................................................................................................................112
5.10.4 Neural Networks in Classification.......................................................................................113
5.11 Sequence Mining..................................................................................................................................113
5.11.1 Apriori Algorithm and Its Extension to Sequence Mining...................................................113
5.11.2 The GSP Algorithm..............................................................................................................114
Summary.....................................................................................................................................................115
References...................................................................................................................................................115
Recommended Reading.............................................................................................................................115
Self Assessment...........................................................................................................................................116
Chapter VI..................................................................................................................................................118
Application and Trends of Data Mining..................................................................................................118
Aim..............................................................................................................................................................118
Objectives....................................................................................................................................................118
Learning outcome........................................................................................................................................118
6.1 Introduction............................................................................................................................................119
6.2 Applications of Data Mining..................................................................................................................119
6.2.1 Aggregation and Approximation in Spatial and Multimedia Data Generalisation................119
IV/JNU OLE
6.2.2 Generalisation of Object Identifiers and Class/Subclass Hierarchies....................................119
6.2.3 Generalisation of Class Composition Hierarchies................................................................ 120
6.2.4 Construction and Mining of Object Cubes........................................................................... 120
6.2.5 Generalisation-Based Mining of Plan Databases by Divide-and-Conquer.......................... 120
6.3 Spatial Data Mining.............................................................................................................................. 120
6.3.1 Spatial Data Cube Construction and Spatial OLAP............................................................. 121
6.3.2 Mining Spatial Association and Co-location Patterns.......................................................... 121
6.3.3 Mining Raster Databases...................................................................................................... 121
6.4 Multimedia Data Mining....................................................................................................................... 121
6.4.1 Multidimensional Analysis of Multimedia Data................................................................... 122
6.4.2 Classification and Prediction Analysis of Multimedia Data................................................. 122
6.4.3 Mining Associations in Multimedia Data............................................................................. 122
6.4.4 Audio and Video Data Mining.............................................................................................. 122
6.5 Text Mining........................................................................................................................................... 122
6.6 Query Processing Techniques............................................................................................................... 123
6.6.1 Ways of dimensionality Reduction for Text.......................................................................... 123
6.6.2 Probabilistic Latent Semantic Indexing schemas ................................................................ 123
6.6.3 Mining the World Wide Web................................................................................................ 124
6.6.4 Challenges............................................................................................................................. 124
6.7 Data Mining for Healthcare Industry.................................................................................................... 124
6.8 Data Mining for Finance....................................................................................................................... 124
6.9 Data Mining for Retail Industry............................................................................................................ 124
6.10 Data Mining for Telecommunication.................................................................................................. 124
6.11 Data Mining for Higher Education..................................................................................................... 125
6.12 Trends in Data Mining........................................................................................................................ 125
6.12.1 Application Exploration . ................................................................................................... 125
6.12.2 Scalable Data Mining Methods ......................................................................................... 125
6.12.3 Combination of Data Mining with Database Systems, Data Warehouse
Systems, and Web Database Systems................................................................................. 125
6.12.4 Standardisation of Data Mining Language......................................................................... 125
6.12.5 Visual Data Mining ............................................................................................................ 126
6.12.6 New Methods for Mining Complex Types of Data ........................................................... 126
6.12.7 Web Mining ....................................................................................................................... 126
6.13 System Products and Research Prototypes......................................................................................... 126
6.13.1 Choosing a Data Mining System........................................................................................ 126
6.14 Additional Themes on Data Mining.................................................................................................... 128
6.14.1 Theoretical Foundations of Data Mining............................................................................ 128
6.14.2 Statistical Data mining........................................................................................................ 129
6.14.3 Visual and Audio Data Mining........................................................................................... 130
6.14.4 Data Mining and Collaborative Filtering............................................................................ 130
Summary.................................................................................................................................................... 132
References.................................................................................................................................................. 132
Recommended Reading............................................................................................................................ 132
Self Assessment.......................................................................................................................................... 133
V/JNU OLE
7.2.4 Establish Clustering Options................................................................................................ 137
7.2.5 Prepare an Indexing Strategy................................................................................................ 138
7.2.6 Assign Storage Structures..................................................................................................... 138
7.2.7 Complete Physical Model..................................................................................................... 138
7.3 Physical Storage.................................................................................................................................... 138
7.3.1 Storage Area Data Structures................................................................................................ 138
7.3.2 Optimising Storage............................................................................................................... 139
7.3.3 Using RAID Technology...................................................................................................... 140
7.4 Indexing the Data Warehouse............................................................................................................... 142
7.4.1 B-Tree Index......................................................................................................................... 142
7.4.2 Bitmapped Index................................................................................................................... 143
7.4.3 Clustered Indexes.................................................................................................................. 143
7.4.4 Indexing the Fact Table......................................................................................................... 144
7.4.5 Indexing the Dimension Tables............................................................................................ 144
7.5 Performance Enhancement Techniques................................................................................................ 144
7.5.1 Data Partitioning................................................................................................................... 144
7.5.2 Data Clustering..................................................................................................................... 145
7.5.3 Parallel Processing................................................................................................................ 145
7.5.4 Summary Levels................................................................................................................... 145
7.5.5 Referential Integrity Checks................................................................................................. 146
7.5.6 Initialisation Parameters....................................................................................................... 146
7.5.7 Data Arrays........................................................................................................................... 146
7.6 Data Warehouse Deployment................................................................................................................ 146
7.6.1 Data warehouse Deployment Lifecycle................................................................................ 147
7.7 Growth and Maintenance...................................................................................................................... 148
7.7.1 Monitoring the Data Warehouse........................................................................................... 148
7.7.2 Collection of Statistics.......................................................................................................... 149
7.7.3 Using Statistics for Growth Planning................................................................................... 150
7.7.4 Using Statistics for Fine-Tuning........................................................................................... 150
7.7.5 Publishing Trends for Users.................................................................................................. 151
7.8 Managing the Data Warehouse............................................................................................................. 151
7.8.1 Platform Upgrades................................................................................................................ 152
7.8.2 Managing Data Growth........................................................................................................ 152
7.8.3 Storage Management............................................................................................................ 152
7.8.4 ETL Management................................................................................................................. 153
7.8.5 Information Delivery Enhancements.................................................................................... 153
7.8.6 Ongoing Fine-Tuning............................................................................................................ 153
7.9 Models of Data Mining......................................................................................................................... 154
Summary.................................................................................................................................................... 156
References.................................................................................................................................................. 156
Recommended Reading............................................................................................................................ 156
Self Assessment.......................................................................................................................................... 157
VI/JNU OLE
List of Figures
Fig. 1.1 Data warehousing.............................................................................................................................. 3
Fig. 1.2 Steps in data warehouse iteration project planning stage.................................................................. 7
Fig. 1.3 Means of identifying required information....................................................................................... 8
Fig. 1.4 Typical data warehousing environment........................................................................................... 10
Fig. 1.5 Overview of data warehouse infrastructure..................................................................................... 12
Fig. 1.6 Data warehouse metadata................................................................................................................ 13
Fig. 1.7 Importance of mapping between two environments........................................................................ 14
Fig. 1.8 Simplest component of metadata..................................................................................................... 14
Fig. 1.9 Storing mapping information in the data warehouse....................................................................... 15
Fig. 1.10 Keeping track of when extracts have been run.............................................................................. 15
Fig. 1.11 Other useful metadata.................................................................................................................... 16
Fig. 2.1 Data design...................................................................................................................................... 21
Fig. 2.2 E-R modelling for OLTP systems.................................................................................................... 22
Fig. 2.3 Dimensional modelling for data warehousing................................................................................. 22
Fig. 2.4 Simple STAR schema for orders analysis....................................................................................... 23
Fig. 2.5 Understanding a query from the STAR schema.............................................................................. 24
Fig. 2.6 STAR schema keys.......................................................................................................................... 25
Fig. 2.7 Source identification process........................................................................................................... 27
Fig. 2.8 Data in operational systems............................................................................................................. 28
Fig. 2.9 Immediate data extraction options................................................................................................... 30
Fig. 2.10 Data extraction using replication technology................................................................................ 31
Fig. 2.11 Deferred data extraction................................................................................................................ 32
Fig. 2.12 Typical data source environment................................................................................................... 36
Fig. 2.13 Enterprise plan-execute-assess closed loop................................................................................... 41
Fig. 3.1 Data mining is the core of knowledge discovery process............................................................... 47
Fig. 3.2 Steps for data mining projects......................................................................................................... 51
Fig. 3.3 Six-sigma methodology................................................................................................................... 51
Fig. 3.4 SEMMA........................................................................................................................................... 51
Fig. 3.5 CRISP–DM is an iterative, adaptive process................................................................................... 53
Fig. 3.6 Data mining techniques................................................................................................................... 55
Fig. 3.7 Methods of mining frequent subgraphs........................................................................................... 56
Fig. 3.8 Heavy-tailed out-degree and in-degree distributions....................................................................... 57
Fig. 3.9 A financial multirelational schema.................................................................................................. 60
Fig. 3.10 Basic sequential covering algorithm.............................................................................................. 65
Fig. 3.11 A general-to-specific search through rule space............................................................................ 66
Fig. 3.12 A multilayer feed-forward neural network.................................................................................... 67
Fig. 3.13 A hierarchical structure for STING clustering............................................................................... 75
Fig. 3.14 EM algorithm................................................................................................................................. 76
Fig. 3.15 Tabular representation of association............................................................................................ 78
Fig. 3.16 Association Rules Networks, 3D................................................................................................... 79
Fig. 4.1 Knowledge base............................................................................................................................... 85
Fig. 4.2 Sequential structure of KDP model................................................................................................. 88
Fig. 4.3 Relative effort spent on specific steps of the KD process............................................................... 89
Fig. 4.4 Web mining architecture.................................................................................................................. 90
Fig. 5.1 Flow diagram of remote sensing image classification with inductive learning............................ 103
Fig. 5.2 Three numerical characteristics..................................................................................................... 104
Fig. 5.3 Using spatial information for estimation from a sample............................................................... 106
Fig. 5.4 Different layers of user query processing...................................................................................... 109
Fig. 5.5 A Markov diagram that describes the probability of program enrolment changes.........................111
Fig. 6.1 Spatial mining................................................................................................................................ 120
Fig. 6.2 Text mining.................................................................................................................................... 123
Fig. 7.1 Physical design process................................................................................................................. 136
Fig. 7.2 Data structures in the warehouse................................................................................................... 139
VII/JNU OLE
Data Mining
VIII/JNU OLE
List of Tables
Table 1.1 Example of source data................................................................................................................... 4
Table 1.2 Example of target data (Data Warehouse)....................................................................................... 4
Table 1.3 Data warehousing elements............................................................................................................ 6
Table 2.1 Basic tasks in data transformation................................................................................................ 34
Table 2.2 Data transformation types............................................................................................................. 36
Table 2.3 Characteristics or indicators of high-quality data......................................................................... 40
Table 2.4 General areas where data warehouse can assist in the planning and assessment phases.............. 42
Table 3.1 The six phases of CRISP-DM....................................................................................................... 55
Table 3.2 Requirements of clustering in data mining................................................................................... 71
Table 5.1 Spatial data mining and knowledge discovery in various viewpoints........................................ 100
Table 5.2 Main spatial knowledge to be discovered................................................................................... 101
Table 5.3 Techniques to be used in SDMKD.............................................................................................. 102
Table 5.4 Terms related to temporal data.................................................................................................... 108
IX/JNU OLE
Data Mining
Abbreviations
ANN - Artificial Neural Network
ANOVA - Analysis Of Variance
ARIMA - autoregressive integrated moving average
ASCII - American Standard Code for Information Interchange
ATM - Automatic Banking Machines
BQA - Business Question Assessment
C&RT - Cart Modelling
CA - California
CGI - Computer Generated Imagery
CHAID - CHi-squared Automatic Interaction Detector
CPU - Central Processing Unit
CRISP-DM - Cross Industry Standard Process for Data Mining
DB - Database
DBMS - Database Management System
DBSCAN - Density-Based Spatial Clustering of Applications with Noise
DDL - definition language statements
DM - Data Mining
DMKD - Data Mining and Knowledge Discovery
DNA - DeoxyriboNucleic Acid
DSS - Decision Support System
DW - Data Warehouse
EBCDIC - Extended Binary Coded Decimal Interchange Code
EIS - Executive Information System
EM - Expectation-Maximisation
En - Entropy
ETL - Extraction, Transformation and Loading
Ex - Expected value
GPS - Global Positioning System
GUI - Graphical User Interface
HIV - Human Immunodeficiency Virus
HTML - Hypertext Markup Language
IR - Information retrieval
IRC - Instant Relay Chat
JAD - JAva Decompiler
KDD - Knowledge Discovery and Data Mining
KDP - Knowledge Discovery Processes
KPI - Key Performance Indication
MB - Megabytes
MBR - Master Boot Record
MRDM - Multirelational data mining
NY - New York
ODBC - Open Database Connectivity
OLAP - Online Analytical Processing
OLTP - Online Transaction Processing
OPTICS - Ordering Points to Identify the Clustering Structure
PC - Personal Computer
RAID - Redundant array of inexpensive disks
RBF - Radial-Basis Function
RDBMS - Relational Database Management System
SAS - Statisctial Analysis Software
SDMKD - Spatial Data Mining and Knowledge Discovery
SEMMA - Sample, Explore, Modify, Model, Assess
SOLAM - Spatial Online Analytical Mining
X/JNU OLE
STING - Statistical Information Grid
TM - Temporal Mediator
UAT - User Acceptance Testing
VLSI - Very Large Scale Integration
WWW - World Wide Web
XML - Extensible Markup Language
XI/JNU OLE
XII/JNU OLE
Chapter I
Data Warehouse – Need, Planning and Architecture
Aim
The aim of this chapter is to:
Objectives
The objectives of this chapter are to:
Learning outcome
At the end of this chapter, you will be able to:
1/JNU OLE
Data Mining
1.1 Introduction
Data warehousing is combining data from various and usually diverse sources into one comprehensive and easily
operated database. Common accessing systems of data warehousing include queries, analysis and reporting. As data
warehousing creates one database at the end, the number of sources can be anything, provided that the system can
handle the volume. The final result, however, is uniform data, which can be more easily manipulated.
Definition:
Although there are several definitions of data warehouse, a widely accepted definition by Inmon (1992) is an
integrated subject-oriented and time-variant repository of information in support of management’s decision making
process. According to Kimball, a data warehouse is “a copy of transaction data specifically structured for query and
analysis”. It is a copy of sets of transactional data, which can come from a range of transactional systems.
• Data warehousing is commonly used by companies to study trends over time. However, its primary function is
facilitating strategic planning resulting from long-term data overviews. From such overviews, forecasts, business
models and similar analytical tools, reports and projections can be made.
• Normally, as the data stored in data warehouses is intended to provide more overview-like reporting, the data
is read-only. After building a new query at the end, the data stored via data warehousing can be updated.
• Besides being a storehouse for large amount of data, data warehouse must possess systems in place that make
it easy to access the data and use it for day to day operations.
• A data warehouse is sometimes said to be a major role player in a decision support system (DSS). DSS is a
technique used by organisations to come up with facts, trends or relationships that can assist them to make
effective decisions or create effective strategies to accomplish their organisational goals.
• Data warehouses involve a long-term effort and are usually built in an incremental fashion. In addition to adding
new subject areas, at each cycle, the breadth of data content of existing subject areas is usually increased as
users expand their analysis and their underlying data requirements.
• Users and applications can directly use the data warehouse to perform their analysis. Alternately, a subset of
the data warehouse data, often relating to a specific line-of business and/or a specific functional area, can be
exported to another, smaller data warehouse, commonly referred to as a data mart.
• Besides integrating and cleansing an organisation’s data for better analysis, one of the benefits of building a
data warehouse is that the effort initially spent to populate it with complete and accurate data content further
benefits any data marts that are sourced from the data warehouse.
2/JNU OLE
Operational
Metadata
User
Metadata and
information
VAX RMS
objects
Use!
VAX Database
Access to Find
operational Transform and
and external and understand
data Distribute
Data
PC Data warehouse
Hardcopy
Example:
In order to store data, many application designers in every branch have made their individual decisions as to how
an application and database should be built. Thus, source systems will be different in naming conventions, variable
measurements, encoding structures and physical attributes of data.
3/JNU OLE
Data Mining
Consider an institution that has got several branches in various countries having hundreds of students. The following
example explains how the data is integrated from source systems to target systems.
System Attribute
Column Name Datatype Values
Name Name
Student
Source
Application STUDENT_APPLICATION_DATE NUMERIC(8,0) 11012005
system 1
Date
Student
Source
Application STUDN_APPLICATION_DATE DATE 11012005
system 2
Date
Source Application
APPLICATION_DATE DATE 01NOV2005
system 3 Date
In the above example, the attribute name, column name, datatype and values are totally different from one source
system to another; this inconsistency in data can be avoided by integrating the data into a data warehouse with good
standards.
System Data
Attribute Name Column Name Values
Name type
In the above example of target data, attribute names, column names and datatypes are consistent throughout the target
system. This is how data from various source systems is integrated and accurately stored into the data warehouse.
Data Integration
Data warehouse helps in combining scattered and unmanageable data into a particular format, which can be easily
accessible. If the required is complicated, it may lead to inaccuracy in business. By arranging the data properly in
a particular format, it is easy to analyse it across all products by location, time and channel.
The IT staff provides the reports required from time to time through a series of manual and automated steps of
stripping or extracting the data from one source, sorting / merging with data from other sources, manually scrubbing
and enriching the data and then running reports against it.
4/JNU OLE
Data warehouse serves not only as a repository for historical data but also as an excellent data integration platform.
The data in the data warehouse is integrated, subject oriented, time-variant and non-volatile to enable you to get a
360° view of your organisation.
The flattened data model makes it much easier for users to understand the data and write queries rather than work
with potentially several hundreds of tables and write long queries with complex table joins and clauses.
These KDD applications use various statistical and data mining techniques and rely on subject oriented, summarised,
cleansed and “de-noised” data, which a well designed data warehouse can readily provide. The data warehouse also
enables an Executive Information System (EIS). Executives typically could not be expected to go through different
reports trying to get a holistic picture of the organisation’s performance and make decisions. They need the KPIs
delivered to them. Some of these KPIs may require cross product or cross departmental analysis, which may be too
manually intensive, if not impossible, to perform on raw data from operational systems. This is especially relevant
to relationship marketing and profitability analysis. The data in data warehouse is already prepared and structured
to support this kind of analysis.
Performance
Finally, the performance of transactional systems and query response time make the case for a data warehouse.
The transactional systems are meant to do just that – perform transactions efficiently – and hence, are designed to
optimise frequent database reads and writes. The data warehouse, on the other hand, is designed to optimise frequent
complex querying and analysis. Some of the ad-hoc queries and interactive analysis, which could be performed
in few seconds to minutes on a data warehouse could take a heavy toll on the transactional systems and literally
drag their performance down. Holding historical data in transactional systems for longer period of time could also
interfere with their performance. Hence, the historical data needs to find its place in the data warehouse.
5/JNU OLE
Data Mining
A storage area and set of processes that clean, transform, combine, de-
duplicate, household, archive and prepare source data for use in the
presentation server. In many cases, the primary objects in this area are
Staging Area a set of flat-file tables representing extracted (from the source systems)
data, loading and transformation routines, and a resulting set of tables
containing clean data – Dynamic Data Store. This area does not usually
provide query and presentation services.
The presentation areas are the target physical machines on which the
data warehouse data is organised and stored for direct querying by end
users, report writers, and other applications. The set of presentable
Presentation Area
data, or Analytical Data Store, normally take the form of dimensionally
modelled tables when stored in a relational database, and cube files
when stored in an OLAP database.
End user data access tools are any clients of the data warehouse. An end
End user Data Access Tools user access tool can be as simple as an ad hoc query tool, or can be as
complex as a sophisticated data mining or modelling application.
6/JNU OLE
Who will use the Data Warehouse
The power data warehouse consumers are business and financial managers. Data warehouses are meant to deliver clear
indications on how the business is performing. Plot out the expected users for the data warehouse in an enterprise.
See to it that they will have the appropriate reports in a format, which is quickly understandable. Ensure that planning
exercises are conducted in advance to accumulate scenarios on how the data warehouse will be used. Always keep
in mind that data has to be presented attractively in a format so as business managers will feel comfortable. Text
files with lines of numbers will not suffice!
Technology
Data warehouse will be built from one of the major relational Database Management System (DBMS) vendors like
Oracle, IBM, Microsoft, and many more. Open source databases, like mySQL, can also support Data Warehousing
with the right support in place.
The data warehouse is implemented (populated) one subject area at a time, driven by specific business questions to
be answered by each implementation cycle. The first and subsequent implementation cycles of the data warehouse
are determined during the Business Question Assessment (BQA) stage, which may have been conducted as a separate
project. At this stage in the data warehouse process or at the start of this development/implementation project, the
first (or next if not first) subject area implementation project is planned.
The business requirements discovered in BQA or an equivalent requirements gathering project and, to a lesser extent,
the technical requirements of the Architecture Review and Design stage (or project) are now refined through user
interviews and focus sessions. The requirements should be refined to the subject area level and further analysed to
yield the detail needed to design and implement a single population project, whether initial or follow-on. The data
warehouse project team is expanded to include the members needed to construct and deploy the Warehouse, and
a detailed work plan for the design and implementation of the iteration project is developed and presented to the
customer organisation for approval.
The following diagram illustrates the sequence in which steps in the data warehouse iteration project planning stage
must be conducted.
Plan Iteration
Development Project
7/JNU OLE
Data Mining
Reports
Existing Live
Spreadsheets analysis interviews
Reports
Existing reports can usually be gathered quickly and inexpensively. In most cases, the information displayed on these
reports is easily discerned. However, old reports represent yesterday’s requirements and the underlying calculation
of information may not be obvious ay all.
Spreadsheets
Spreadsheets are able to be easily gathered by asking the DSS analyst community. Like standard reports, the
information on spreadsheets is able to be discerned easily. The problem with spreadsheets:
• They are very fluid, for example, important spreadsheets may have been created several months ago that are
not available now.
• They change with no documentation.
• They may not be able to be easily gathered unless the analyst creating them wants them to be gathered.
• Their structure and usage of data may be obtuse.
Live interviews
Typically, through interviews or JAD sessions, the end user can tell about the informational needs of the organisation.
Unfortunately, JAD sessions require an enormous amount of energy to conduct and assimilate. Furthermore, the
effectiveness of JAD sessions depend in no small part on the imagination and spontaneity of the end user participating
in the session.
In any case, gathering the obvious and easily accessed informational needs of the organisation should be done and
should be factored into the data warehouse data model prior to the development of the first iteration of the data
warehouse.
8/JNU OLE
During the Architecture Review and Design stage, the logical data warehouse architecture is developed. The logical
architecture is a configuration map of the necessary data stores that make up the Warehouse; it includes a central
Enterprise Data Store, an optional Operational Data Store, one or more (optional) individual business area Data
Marts, and one or more Metadata stores. In the metadata, store(s) are two different kinds of metadata that catalogue
reference information about the primary data.
Once the logical configuration is defined, the Data, Application, Technical and Support Architectures are designed to
physically implement it. Requirements of these four architectures are carefully analysed, so that the data warehouse
can be optimised to serve the users. Gap analysis is conducted to determine, which components of each architecture
already exist in the organisation and can be reused, and which components must be developed (or purchased) and
configured for the data warehouse.
The data architecture organises the sources and stores of business information and defines the quality and management
standards for data and metadata.
The application architecture is the software framework that guides the overall implementation of business functionality
within the Warehouse environment; it controls the movement of data from source to user, including the functions
of data extraction, data cleansing, data transformation, data loading, data refresh, and data access (reporting,
querying).
The technical architecture provides the underlying computing infrastructure that enables the data and application
architectures. It includes platform/server, network, communications and connectivity hardware/software/middleware,
DBMS, client/server 2-tier vs.3-tier approach, and end-user workstation hardware/software. Technical architecture
design must address the requirements of scalability, capacity and volume handling (including sizing and partitioning
of tables), performance, availability, stability, chargeback, and security.
The support architecture includes the software components (example, tools and structures for backup/recovery,
disaster recovery, performance monitoring, reliability/stability compliance reporting, data archiving, and version
control/configuration management) and organisational functions necessary to effectively manage the technology
investment.
Architecture review and design applies to the long-term strategy for development and refinement of the overall data
warehouse, and is not conducted merely for a single iteration. This stage (or project) develops the blueprint of an
encompassing data and technical structure, software application configuration, and organisational support structure
for the Warehouse. It forms a foundation that drives the iterative Detail Design activities. Where Detail Design tells
you what to do; Architecture Review and Design tells you what pieces you need in order to do it.
The architecture review and design stage can be conducted as a separate project that can run mostly in parallel with
the business question assessment stage. For the technical, data, application and support infrastructure that enables
and supports the storage and access of information is generally independent from the business requirements of
which data is needed to drive the Warehouse. However, the data architecture depends on receiving input from
certain BQA or alternative business requirements analysis activities (such as data source system identification and
data modelling), therefore the BQA stage or similar business requirements identification activities must conclude
before the Architecture stage or project can conclude.
The architecture will be developed based on the organisation’s long-term data warehouse strategy, so that each future
iteration of the warehouse will be provided for and will fit within the overall data warehouse architecture.
Data warehouses can be architected in many different ways, depending on the specific needs of a business. The model
shown below is the “hub-and-spokes” Data Warehousing architecture that is popular in many organisations.
9/JNU OLE
Data Mining
In short, data is moved from databases used in operational systems into a data warehouse staging area, then into a
data warehouse and finally into a set of conformed data marts. Data is copied from one database to another using a
technology called ETL (Extract, Transform, Load).
Customer
database ETL
ETL
Products
database
Operational applications
The principal reason why business needs to create data warehouses is that their corporate data assets are fragmented
across multiple, disparate applications systems, running on different technical platforms in different physical locations.
This situation does not enable good decision making.
When data redundancy exists in multiple databases, data quality often deteriorates. Poor business intelligence results
in poor strategic and tactical decision making.
Individual business units within an enterprise are designated as “owners” of operational applications and databases.
These “organisational silos” sometimes do not understand the strategic importance of having well integrated, non-
redundant corporate data. Consequently, they frequently purchase or build operational systems that do not integrate
well with existing systems in the business.
Data management issues have deteriorated in recent years as businesses deployed a parallel set of e-business and
ecommerce applications that do not integrate with existing “full service” operational applications.
Operational databases are normally “relational” - not “dimensional”. They are designed for operational, data entry
purposes and are not well suited for online queries and analytics.
Due to globalisation, mergers and outsourcing trends, the need to integrate operational data from external organisations
has arisen. The sharing of customer and sales data among business partners can, for example, increase business
intelligence for all business partners.
The challenge for data warehousing is to be able to quickly consolidate, cleanse and integrate data from multiple,
disparate databases that run on different technical platforms in different geographical locations.
10/JNU OLE
Extraction transform loading
ETL technology (shown in the fig.1.4with arrows) is an important component of the data warehousing Architecture.
It is used to copy data from operational applications to the data warehouse staging area, from the DW staging area
into the data warehouse and finally from the data warehouse into a set of conformed data marts that are accessible
by decision makers.
The ETL software extracts data, transforms values of inconsistent data, cleanses “bad” data, filters data and loads
data into a target database. The scheduling of ETL jobs is critical. Should there be a failure in one ETL job, the
remaining ETL jobs must respond appropriately.
Due to varying business cycles, data processing cycles, hardware and network resource limitations and geographical
factors, it is not feasible to extract all the data from all operational databases at exactly the same time.
For example, it might be reasonable to extract sales data on a daily basis; however, daily extracts might not be
suitable for financial data that requires a month-end reconciliation process. Similarly, it might be feasible to extract
“customer” data from a database in Singapore at noon eastern standard time, but this would not be feasible for
“customer” data in a Chicago database.
Data in the data warehouse can be either persistent (remains around for a long period) or transient (ionly remains
around temporarily).
Not all business requires a data warehouse staging area. For many businesses, it is feasible to use ETL to copy data
directly from operational databases into the data warehouse.
Data marts
ETL (Extract Transform Load) jobs extract data from the data warehouse and populate one or more data marts
for use by groups of decision makers in the organisations. The data marts can be dimensional (Star Schemas) or
relational, depending on how the information is to be used and what “front end” data warehousing tools will be
used to present the information.
Each data mart can contain different combinations of tables, columns and rows from the enterprise data
warehouse.
For example, a business unit or user group that does not need enough of historical data might only need transactions
from the current calendar year in the database. The personnel department might need to see all details about
employees, whereas data such as “salary” or “home address” might not be appropriate for a data mart that focuses
on Sales. Some data mart might need to be refreshed from the data warehouse daily, whereas user groups might
need refreshes only monthly.
1.5.1 Infrastructure
A data warehouse is a ‘business infrastructure’. In a practical world, it does not do anything on its own, but provides
sanitised, consistent and integrated information for host of applications and end-user tools. Therefore, the stability,
availability and response time of this platform is critical. Just like a foundation pillar, its strength is core to your
information management success.
11/JNU OLE
Data Mining
OLAP
Data
Mart
Data
ETL warehouse Data mining
OLTP server Data
Mart
Data
ODS visualisation
Data Flow
12/JNU OLE
1.5.2 Metadata
Metadata is data about data. Metadata has been around as long as there have been programs and data that the programs
operate on. Following figure shows metadata in a simple form.
Infrastructure tier
Browser
Metadata
repository
While metadata is not new, the role of metadata and its importance in the face of the data warehouse certainly is
new. For years, the information technology professional has worked in the same environment as metadata, but in
many ways has paid little attention to metadata. The information professional has spent a life dedicated to process
and functional analysis, user requirements, maintenance, architectures, and the like. The role of metadata has been
passive at best in this situation.
However, metadata plays a very different role in data warehouse. Relegating metadata to a backwater, passive role
in the data warehouse environment is to defeat the purpose of data warehouse. Metadata plays a very active and
important part in the data warehouse environment. The reason why metadata plays such an important and active role
in the data warehouse environment is apparent when contrasting the operational environment to the data warehouse
environment in so far as the user community is concerned.
Mapping
A basic part of the data warehouse environment is that of mapping from the operational environment into the data
warehouse. The mapping includes a wide variety of facets, including, but not limited to:
• mapping from one attribute to another
• conversions
• changes in naming conventions
• changes in physical characteristics of data
• filtering of data
Following figure shows the storing of the mapping in metadata for the data warehouse.
13/JNU OLE
Data Mining
Metadata
Operational
environment
Mapping
Data
warehouse
Fig. 1.7 Importance of mapping between two environments
(Source: http://www.inmoncif.com/registration/whitepapers/ttmeta-1.pdf)
It may not be obvious why mapping information is so important in the data warehouse environment. Consider the
vice president of marketing who has just asked for a new report. The DSS analyst turns to the data warehouse for the
data for the report. Upon inspection, the vice president proclaims the report to be fiction. The credibility of the DSS
analyst goes down until the DSS analyst can prove the data in the report to be valid. The DSS analyst first looks to
the validity of the data in the warehouse. If the data warehouse data has not been reported properly, then the reports
are adjusted. However, if the reports have been made properly from the data warehouse, the DSS analyst is in the
position of having to go back to the operational source to salvage credibility. At this point, if the mapping data has
been carefully stored, then the DSS analyst can quickly and gracefully go to the operational source. However, if the
mapping has not been stored or has not been stored properly, then the DSS analyst has a difficult time defending
his/her conclusions to management. The metadata store for the data warehouse is a natural place for the storing of
mapping information.
Metadata
Data
warehouse
Fig. 1.8 Simplest component of metadata
(Source: http://www.inmoncif.com/registration/whitepapers/ttmeta-1.pdf)
Mapping
The typical contents of mapping metadata that are stored in the data warehouse metadata store are:
• identification of source field(s)
• simple attribute to attribute mapping
• attributes conversions
• physical characteristic conversions
14/JNU OLE
• encoding/reference table conversions
• naming changes
• key changes
• defaults
• logic to choose from multiple sources
• algorithmic changes, and so forth
Operational Metadata
environment
Data
warehouse
appl A appl B
Mapping
appl C appl D
Extract History
The actual history of extracts and transformations of data coming from the operational environment and heading for
the data warehouse environment is another component that belongs in the data warehouse metadata store.
Operational
environment
Metadata
Data
warehouse
Extract
history
The extract history simply tells the DSS analyst when data entered the data warehouse. The DSS analyst has many
uses for this type of information. One occasion is when the DSS analyst wants to know when the last time data in
the warehouse was refreshed. Another occasion is when the DSS analyst wants to do what if processing and the
assertions of analysis have changed. The DSS analyst needs to know whether the results obtained for one analysis
are different from results obtained by an earlier analysis because of a change in the assertions or a change in the
data. There are many cases where the DSS analyst needs to use the precise history of when insertions have been
done to the data warehouse.
15/JNU OLE
Data Mining
Miscellaneous
Alias information is an attribute and key information that allows for alternative names. Alternative names often
make a data warehouse environment much more “user friendly”. In some cases, technicians have influenced naming
conventions that cause data warehouse names to be incomprehensible.
Metadata
• Alias information
• Status
• Volumetrics
• Aging/purge criteria
Data
warehouse
In other cases, one department names for data have been entered into the warehouse, and another department would
like to have their names for the data imposed. Alias’ are a good way to resolve these issues. Another useful data
warehouse metadata component is that of status. In some cases, a data warehouse table is undergoing design. In
other cases, the table is inactive or may contain misleading data. The existence of a status field is a good way to
resolve these differences. Volumetrics are measurements about data in the warehouse. Typical volumetric information
might include:
• the number of rows currently in the table
• the growth rate of the table
• the statistical profile of the table
• the usage characteristics of the table
• the indexing for the table and its structure and
• the byte specifications for the table..
Volumetric information is useful for the DSS analyst planning an efficient usage of the data warehouse. It is much
more effective to consult volumetric before submitting a query that will use unknown resources than it is to simply
submit the query and hope for the best.
Aging/purge criteria is also an important component of data warehouse metadata. Looking into the metadata store
for a definition of the life cycle of data warehouse data is much more efficient than trying to divine the life cycle
by examining the data inside the warehouse.
16/JNU OLE
Summary
• Data warehousing is combining data from various and usually diverse sources into one comprehensive and
easily operated database.
• Common accessing systems of data warehousing include queries, analysis and reporting.
• Data warehousing is commonly used by companies to analyse trends over time.
• Data Warehousing is an important part and in most cases it is the foundation of business intelligence
architecture.
• Data warehouse helps in combining scattered and unmanageable data into a particular format, which can be
easily accessible.
• The data warehouse is designed specifically to support querying, reporting and analysis tasks.
• Knowledge discovery and data mining (KDD) is the automatic extraction of non-obvious hidden knowledge
from large volumes of data.
• For successful data warehousing, proper planning and management is necessary. For this it is necessary to
fulfil all necessary requirements. Bad planning and improper project management practice is the main factor
for failures in data warehouse project planning.
• Data warehousing comes in all shapes and sizes, which bear a direct relationship to cost and time involved.
• The architecture is the logical and physical foundation on which the data warehouse will be built.
• The data warehouse staging area is temporary location where data from source systems is copied.
• Metadata is data about data. Metadata has been around as long as there have been programs and data that the
programs operate on.
• A basic part of the data warehouse environment is that of mapping from the operational environment into the
data warehouse.
References
• Mailvaganam, H., 2007. Data Warehouse Project Management [Online] Available at: <http://www.dwreview.
com/Articles/Project_Management.html>. [Accessed 8 September 2011].
• Hadley, L., 2002. Developing a Data Warehouse Architecture [Online] Available at: <http://www.users.qwest.
net/~lauramh/resume/thorn.htm>. [Accessed 8 September 2011].
• Humphries, M., Hawkins, M. W. And Dy, M. C., 1999. Data warehousing: architecture and implementation,
Prentice Hall Profesional.
• Ponniah, P., 2001. DATA WAREHOUSING FUNDAMENTALS-A Comprehensive Guide for IT Professionals,
Wiley-Interscience Publication.
• Kumar, A., 2008. Data Warehouse Layered Architecture 1 [Video Online] Available at: < http://www.youtube.
com/watch?v=epNENgd40T4>. [Accessed 11 September 2011].
• Intricity101, 2011. What is OLAP? [Video Online] Available at: <http://www.youtube.com/watch?v=2ryG3Jy
6eIY&feature=related >. [Accessed 12 September 2011].
Recommended Reading
• Parida, R., 2006. Principles & Implementation of Data Warehousing, Firewell Media.
• Khan, A., 2003. Data Warehousing 101: Concepts and Implementation, iUniverse.
• Jarke, M., 2003. Fundamentals of data warehouses, 2nd ed., Springer.
17/JNU OLE
Data Mining
Self Assessment
1. _____________ is combining data from diverse sources into one comprehensive and easily operated
database?
a. Data warehousing
b. Data mining
c. Mapping
d. Metadata
3. ________________
is a technique used by organisations to come up with facts, trends or relationships that can
help them make effective decisions.
a. Mapping
b. Operation analysis
c. Decision support system
d. Data integration
5. Bad planning and improper ___________ practice is the main factor for failures in data warehouse project
planning.
a. project management
b. operation management
c. business management
d. marketing management
6. __________ comes in all shapes and sizes, which bears a direct relationship to cost and time involved.
a. Metadata
b. Data mining
c. Mapping
d. Data warehousing
7. The ________ simply tells the DSS analyst when data entered the data warehouse.
a. mapping
b. components
c. extract history
d. miscellaneous
18/JNU OLE
8. Which of the following information is useful for the DSS analyst planning an efficient usage of the data
warehouse?
a. Matrices
b. Volumetric
c. Algebraic
d. Statistical
19/JNU OLE
Data Mining
Chapter II
Data Design and Data Representation
Aim
The aim of this chapter is to:
Objectives
The objectives of this chapter are to:
Learning outcome
At the end of this chapter, you will be able to:
20/JNU OLE
2.1 Introduction
Data design consists of putting together the data structures. A group of data elements form a data structure. Logical
data design includes determination of the various data elements that are needed and combination of the data elements
into structures of data. Logical data design also includes establishing the relationships among the data structures.
Observe in the following figure, how the phases start with requirements gathering. The results of the requirements
gathering phase is documented in detail in the requirements definition document. An essential component of this
document is the set of information package diagrams. Remember that these are information matrices showing the
metrics, business dimensions, and the hierarchies within individual business dimensions. The information package
diagrams form the basis for the logical data design for the data warehouse. The data design process results in a
dimensional data model.
Requirements
Definition
Document
Requirements Information
gathering packages
Dimen- Data
sional Design
model
21/JNU OLE
Data Mining
You can use a case tool to define the tables, the attributes, and the relationships. You can assign the primary keys
and indicate the foreign keys. You can form the entity-relationship diagrams. All of this is done very easily using
graphical user interfaces and powerful drag-and-drop facilities. After creating an initial model, you may add fields,
delete fields, change field characteristics, create new relationships, and make any number of revisions with utmost
ease.
Another very useful function found in the case tools is the ability to forward-engineer the model and generate the
schema for the target database system you need to work with. Forward-engineering is easily done with these case
tools.
Entity-Relationship Modelling
Removes data redundancy
Ensures data consistency
Expresses microscopic
relationships
Dimensional Modelling
Captures critical measures
Views along dimensions
Intuitive to business users
For modelling the data warehouse, one needs to know the dimensional modelling technique. Most of the existing
vendors have expanded their modelling case tools to include dimensional modelling. You can create fact tables,
dimension tables, and establish the relationships between each dimension table and the fact table. The result is a
STAR schema for your model. Again, you can forward-engineer the dimensional STAR model into a relational
schema for your chosen database management system.
22/JNU OLE
2.4 Star Schema
Creating the STAR schema is the fundamental data design technique for the data warehouse. It is necessary to gain
a good grasp of this technique.
Customer
Product
Customer name
Product name
Customer code
SKU
Billing address
Brand
Shipping address
Order Measures
Order dollars
cost
Margin dollars
quantity sold
Order Date
Salesperson
Date
Salesperson name
Month
Territory name
Quarter
Region name
Year
The users in this department will analyse the orders using dollar amounts, cost, profit margin, and sold quantity. This
information is found in the fact table of the structure. The users will analyse these measurements by breaking down
the numbers in combinations by customer, salesperson, date, and product. All these dimensions along which the
users will analyse are found in the structure. The STAR schema structure is a structure that can be easily understood
by the users and with which they can comfortably work. The structure mirrors how the users normally view their
critical measures along with their business dimensions.
When you look at the order dollars, the STAR schema structure intuitively answers the questions of what, when, by
whom, and to whom. From the STAR schema, the users can easily visualise the answers to these questions: For a
given amount of dollars, what was the product sold? Who was the customer? Which salesperson brought the order?
When was the order placed?
When a query is made against the data warehouse, the results of the query are produced by combining or joining one
of more dimension tables with the fact table. The joins are between the fact table and individual dimension tables.
The relationship of a particular row in the fact table is with the rows in each dimension table. These individual
relationships are clearly shown as the spikes of the STAR schema.
Take a simple query against the STAR schema. Let us say that the marketing department wants the quantity sold and
order dollars for product bigpart-1, relating to customers in the state of Maine, obtained by salesperson Jane Doe,
during the month of June. Following figure shows how this query is formulated from the STAR schema. Constraints
and filters for queries are easily understood by looking at the STAR schema.
23/JNU OLE
Data Mining
Product name
= bigpart-1 State = Maine
Customer
Product
Customer name
Product name
Customer code
SKU
Billing address
Brand
Shipping address
Order Measures
Order dollars
cost
Margin dollars
quantity sold
Order Date
Salesperson
Date
Salesperson name
Month
Territory name
Quarter
Region name
Year
24/JNU OLE
Fact table
Store dimension
Store dimension
STORE KEY
PRODUCT KEY
TIME KEY Time dimension
Store Desc Dollars
District ID Unit
District Desc
Region ID
Region Desc
Level
Primary Keys
Each row in a dimension table is identified by a unique value of an attribute designated as the primary key of the
dimension. In a product dimension table, the primary key identifies each product uniquely. In the customer dimension
table, the customer number identifies each customer uniquely. Similarly, in the sales representative dimension table,
the social security number of the sales representative identifies each sales representative.
Surrogate Keys
There are two general principles to be applied when choosing primary keys for dimension tables. The first principle
is derived from the problem caused when the product began to be stored in a different warehouse. In other words,
the product key in the operational system has built-in meanings. Some positions in the operational system product
key indicate the warehouse and some other positions in the key indicate the product category. These are built-in
meanings in the key. The first principle to follow is: avoid built-in meanings in the primary key of the dimension
tables.
The data of the retired customer may still be used for aggregations and comparisons by city and state. Therefore,
the second principle is: do not use production system keys as primary keys for dimension tables. The surrogate keys
are simply system-generated sequence numbers. They do not have any built-in meanings. Of course, the surrogate
keys will be mapped to the production system keys. Nevertheless, they are different. The general practice is to keep
the operational system keys as additional attributes in the dimension tables. Please refer back to Figure 2.5. The
STORE KEY is the surrogate primary key for the store dimension table. The operational system primary key for
the store reference table may be kept as just another non-key attribute in the store dimension table.
Foreign Keys
Each dimension table is in a one-to-many relationship with the central fact table. So the primary key of each
dimension table must be a foreign key in the fact table. If there are four dimension tables of product, date, customer,
and sales representative, then the primary key of each of these four tables must be present in the orders fact table
as foreign keys.
25/JNU OLE
Data Mining
In multidimensional information package diagram we have discussed the foundation for the dimensional model.
Therefore, the dimensional model consists of the specific data structures needed to represent the business dimensions.
These data structures also contain the metrics or facts.
Dimensional modelling is a technique for conceptualising and visualising data models as a set of measures that are
described by common aspects of the business. Dimensional modelling has two basic concepts.
Dimension is the parameter over which analysis of facts are performed. This parameter that gives meaning to a
measure number of customers is a fact, perform analysis over time.
These two factors increase the complexity of data extraction for a data warehouse and, therefore, warrant the use
of third-party data extraction tools in addition to in-house programs or scripts. Third-party tools are generally more
expensive than in-house programs, but they record their own metadata. On the other hand, in-house programs
increase the cost of maintenance and are hard to maintain as source systems change.
If the company is in an industry where frequent changes to business conditions are the norm, then you may want
to minimise the use of in-house programs. Third-party tools usually provide built-in flexibility. For this, change
the input parameters for the third-part tool, which is in use. Effective data extraction is a key to the success of data
warehouse. Therefore, special attention is required to the issue and formulates a data extraction strategy for your
data warehouse. Here is a list of data extraction issues:
• Source Identification: identify source applications and source structures.
• Method of extraction: for each data source, define whether the extraction process is manual or tool-based.
• Extraction frequency: for each data source, establish how frequently the data extraction must by done daily,
weekly, quarterly, and so on.
• Time window: for each data source, denote the time window for the extraction process.
26/JNU OLE
• Job sequencing: determine whether the beginning of one job in an extraction job stream has to wait until the
previous job has finished successfully.
• Exception handling—: determine how to handle input records that cannot be extracted.
Assume that a part of the database, maybe one of the data marts, is designed to provide strategic information on the
fulfilment of orders. For this purpose, it is necessary to store historical information about the fulfilled and pending
orders. If the orders are shipped through multiple delivery channels, one needs to capture data about these channels.
If the users are interested in analysing the orders by the status of the orders as the orders go through the fulfilment
process, then one needs to extract data on the order statuses. In the fact table for order fulfilment, one needs attributes
about the total order amount, discounts, commissions, expected delivery time, actual delivery time, and dates at
different stages of the process. One needs dimension tables for product, order disposition, delivery channel, and
customer. First, it is necessary to determine if one has source systems to provide you with the data needed for this
data mart. Then, from the source systems, one needs to establish the correct data source for each data element in the
data mart. Further, go through a verification process to ensure that the identified sources are really the right ones.
Following figure describes a stepwise approach to source identification for order fulfilment. Source identification
is not as simple process as it may sound. It is a critical first process in the data extraction function. You need to go
through the source identification process for every piece of information you have to store in the data warehouse.
27/JNU OLE
Data Mining
Data in the source systems are said to be time-dependent or temporal. This is because source data changes with
time. The value of a single variable varies over time.
Next, take the example of the change of address of a customer for a move from New York State to California. In
the operational system, what is important is that the current address of the customer has CA as the state code. The
actual change transaction itself, stating that the previous state code was NY and the revised state code is CA, need
not be preserved. But think about how this change affects the information in the data warehouse. If the state code is
used for analysing some measurements such as sales, the sales to the customer prior to the change must be counted
in New York State and those after the move must be counted in California. In other words, the history cannot be
ignored in the data warehouse. This arise the question: how to capture the history from the source systems? The
answer depends on how exactly data is stored in the source systems. So let us examine and understand how data is
stored in the source operational systems.
9/15/2000 Changed to CA
OH CA NY NJ
1/22/2001 Changed to NY
3/1/2001 Changed to NJ
6/1/2000 Value: RE
(Property receipted)
9/15/2000 Changed to ES 6/1/2000 RE 6/1/2000 RE 6/1/2000 RE 6/1/2000 RE
(Value estimated) 9/15/2000 ES 9/15/2000 ES 9/15/2000 ES
1/22/2001 AS 1/22/2001 AS
1/22/2001 Changed to AS 3/1/2001 SL
(Assigned to auction)
3/1/2001 Changed to SL
(Property sold)
28/JNU OLE
Current value
Most of the attributes in the source systems fall into this category. Here, the stored value of an attribute represents
the value of the attribute at this moment of time. The values are transient or transitory. As business transactions
happen, the values change. There is no way to predict how long the present value will stay or when it will get
changed next. Customer name and address, bank account balances, and outstanding amounts on individual orders
are some examples of this category. What is the implication of this category for data extraction? The value of an
attribute remains constant only until a business transaction changes it. There is no telling when it will get changed.
Data extraction for preserving the history of the changes in the data warehouse gets quite involved for this category
of data.
Periodic status
This category is not as common as the previous category. In this category, the value of the attribute is preserved as
the status every time a change occurs. At each of these points in time, the status value is stored with reference to
the time when the new value became effective. This category also includes events stored with reference to the time
when each event occurred. Look at the way data about an insurance policy is usually recorded in the operational
systems of an insurance company. The operational databases store the status data of the policy at each point of
time when something in the policy changes. Similarly, for an insurance claim, each event, such as claim initiation,
verification, appraisal, and settlement, is recorded with reference to the points in time. For operational data in this
category, the history of the changes is preserved in the source systems themselves. Therefore, data extraction for
the purpose of keeping history in the data warehouse is relatively easier. Whether it is status data or data about an
event, the source systems contain data at each point in time when any change occurred. Pay special attention to the
examples. Having reviewed the categories indicating how data is stored in the operational systems, we are now in
a position to discuss the common techniques for data extraction. When you deploy your data warehouse, the initial
data as of a certain time must be moved to the data warehouse to get it started. This is the initial load. After the
initial load, your data warehouse must be kept updated so the history of the changes and statuses are reflected in the
data warehouse. Broadly, there are two major types of data extractions from the source operational systems: “as is”
(static) data and data of revisions.
“As is” or static data is the capture of data at a given point in time. It is like taking a snapshot of the relevant
source data at a certain point in time. For current or transient data, this capture would include all transient
data identified for extraction. In addition, for data categorised as periodic, this data capture would include
each status or event at each point in time as available in the source operational systems. Primarily, you will
use static data capture for the initial load of the data warehouse. Sometimes, you may want a full refresh of
a dimension table. For example, assume that the product master of your source application is completely
revamped. In this case, you may find it easier to do a full refresh of the product dimension table of the target
data warehouse. Therefore, for this purpose, you will perform a static data capture of the product data.
Data of revisions is also known as incremental data capture. Strictly, it is not incremental data but the
revisions since the last time data was captured. If the source data is transient, the capture of the revisions
is not easy. For periodic status data or periodic event data, the incremental data capture includes the values
of attributes at specific times. Extract the statuses and events that have been recorded since the last date of
extract. Incremental data capture may be immediate or deferred. Within the group of immediate data capture
there are three distinct options. Two separate options are available for deferred data capture.
29/JNU OLE
Data Mining
Source databases
Transaction
Source log
operational files
systems Source
Data
Option 1:
Trigger
programs Capture through
transaction logs
DBMS
Extract
files from
source
systems Option 2:
Output files
of trigger Capture through
programs database triggers
Option 3:
Capture in source
application
area
staging
Data
If all source systems are database applications, there is no problem with this technique. But if some of your source
system data is on indexed and other flat files, this option will not work for these cases. There are no log files for
these non-database applications. You will have to apply some other data extraction technique for these cases. While
we are on the topic of data capture through transaction logs, let us take a side excursion and look at the use of
replication. Data replication is simply a method for creating copies of data in a distributed environment. Following
figure illustrates how replication technology can be used to capture changes to source data.
30/JNU OLE
Source Databases
Transaction
Source log
operational files
systems
Source
Data
DBMS
Log Transaction
Manager
The appropriate transaction logs contain all the changes to the various source database tables. Here are the broad
steps for using replication to capture changes to source data:
Identify the source system DB table
Identify and define target files in staging area
Create mapping between source table and target files
Define the replication mode
Schedule the replication process
Capture the changes from the transaction logs
Transfer captured data from logs to target files
Verify transfer of data changes
Confirm success or failure of replication
In metadata, document the outcome of replication
Maintain definitions of sources, targets, and mappings
31/JNU OLE
Data Mining
Source Databases
Source
operational
systems Today’s
Source Extract
Data
DBMS Yesterday’s
Extract extract
programs File
comparison
Option 1: programs
capture based
Extract
on date and time
files based
on file Option 2:
Extract stamp
comparison capture by
files based
comparing files
on time
stamp
area
staging
Data
Fig. 2.11 Deferred data extraction
32/JNU OLE
Any intermediary states between two data extraction runs are lost. Deletion of source records presents a special
problem. If a source record gets deleted in between two extract runs, the information about the delete is not detected.
You can get around this by marking the source record for delete first, do the extraction run, and then go ahead and
physically delete the record. This means you have to add more logic to the source applications.
While performing today’s data extractio n for changes to product data, you do a full file comparison between today’s
copy of the product data and yesterday’s copy. You also compare the record keys to find the inserts and deletes.
Then you capture any changes between the two copies.
This technique necessitates the keeping of prior copies of all the relevant source data. Though simple and
straightforward, comparison of full rows in a large file can be very inefficient. However, this may be the only feasible
option for some legacy data sources that do not have transaction logs or time stamps on source records.
Before moving the extracted data from the source systems into the data warehouse, you inevitably have to perform
various kinds of data transformations. You have to transform the data according to standards because they come
from many dissimilar source systems. You have to ensure that after all the data is put together, the combined data
does not violate any business rules.
Consider the data structures and data elements that you need in your data warehouse. Now think about all the
relevant data to be extracted from the source systems. From the variety of source data formats, data values, and
the condition of the data quality, you know that you have to perform several types of transformations to make
the source data suitable for your data warehouse. Transformation of source data encompasses a wide variety of
manipulations to change all the extracted source data into usable information to be stored in the data warehouse.
Many companies underestimate the extent and complexity of the data transformation functions. They start out with
a simple departmental data mart as the pilot project. Almost all of the data for this pilot comes from a single source
application. The data transformation just entails field conversions and some reformatting of the data structures. Do
not make the mistake of taking the data transformation functions too lightly. Be prepared to consider all the different
issues and allocate sufficient time and effort to the task of designing the transformations.
Irrespective of the variety and complexity of the source operational systems, and regardless of the extent of your
data warehouse, you will find that most of your data transformation functions break down into a few basic tasks.
Let us go over these basic tasks so that you can view data transformation from a fundamental perspective. Here is
the set of basic tasks:
33/JNU OLE
Data Mining
This takes place at the beginning of the whole process of data transformation. You
select either whole records or parts of several records from the source systems.
The task of selection usually forms part of the extraction function itself. However,
Selection in some cases, the composition of the source structure may not be amenable to
selection of the necessary parts during data extraction. In these cases, it is prudent
to extract the whole record and then do the selection as part of the transformation
function.
This task includes the types of data manipulation you need to perform on the selected
parts of source records. Sometimes (uncommonly), you will be splitting the selected
Splitting/joining
parts even further during data transformation. Joining of parts selected from many
source systems is more widespread in the data warehouse environment.
Sometimes you may find that it is not feasible to keep data at the lowest level of
detail in your data warehouse. It may be that none of your users ever need data at
the lowest granularity for analysis or querying. For example, for a grocery chain,
Summarisation sales data at the lowest level of detail for every transaction at the checkout may
not be needed. Storing sales by product by store by day in the data warehouse
may be quite adequate. So, in this case, the data transformation function includes
summarisation of daily sales by product and by store.
This task is the rearrangement and simplification of individual fields to make them
more useful for the data warehouse environment. You may use one or more fields
Enrichment from the same input record to create a better view of the data for the data warehouse.
This principle is extended when one or more fields originate from multiple records,
resulting in a single field for the data warehouse.
These revisions include changes to the data types and lengths of individual
fields. In your source systems, product package types may be indicated by
codes and names in which the fields are numeric and text data types. Again,
Format Revisions
the lengths of the package types may vary among the different source systems.
It is wise to standardise and change the data type to text to provide values
meaningful to the users.
This is also a common type of data transformation. When you deal with
multiple source systems, you are bound to have the same data items described
by a plethora of field values. The classic example is the coding for gender, with
one source system using 1 and 2 for male and female and another system using
Decoding of Fields M and F. Also, many legacy systems are notorious for using cryptic codes to
represent business values. What do the codes AC, IN, RE, and SU mean in
a customer file? You need to decode all such cryptic codes and change these
into values that make sense to the users. Change the codes to Active, Inactive,
Regular, and Suspended.
34/JNU OLE
The extracted data from the sales system contains sales amounts, sales units,
Calculated and and operating cost estimates by product. You will have to calculate the total
Derived Values cost and the profit margin before data can be stored in the data warehouse.
Average daily balances and operating ratios are examples of derived fields.
Earlier legacy systems stored names and addresses of customers and employees
in large text fields. The first name, middle initials, and last name were stored
as a large text in a single field. Similarly, some earlier systems stored city,
Splitting of single state, and Zip Code data together in a single field. You need to store individual
fields components of names and addresses in separate fields in your data warehouse
for two reasons. First, you may improve the operating performance by indexing
on individual components. Second, your users may need to perform analysis by
using individual components such as city, state, and Zip Code.
This is not quite the opposite of splitting of single fields. This type of data
transformation does not literally mean the merging of several fields to create
a single field of data. For example, information about a product may come
Merging of from different data sources. The product code and description may come from
Information one data source. The relevant package types may be found in another data
source. The cost data may be from yet another source. In this case, merging of
information denotes the combination of the product code, description, package
types, and cost into a single entity.
This type relates to representation of date and time in standard formats. For
example, the American and the British date formats may be standardised to
Date/Time
an international format. The date of October 11, 2000 is written as 10/11/2000
Conversion
in the U.S. format and as 11/10/2000 in the British format. This date may be
standardised to be written as 11 OCT 2000.
While extracting data from your input sources, look at the primary keys of
the extracted records. You will have to come up with keys for the fact and
dimension tables based on the keys in the extracted records. When choosing
Key Reconstructing
keys for your data warehouse database tables, avoid such keys with built-in
meanings. Transform such keys into generic keys generated by the system
itself. This is called key restructuring.
35/JNU OLE
Data Mining
In many companies, the customer files have several records for the same
customer. Mostly, the duplicates are the result of creating additional records
by mistake. In your data warehouse, you want to keep a single record for one
Duplication
customer and link all the duplicates in the source systems to this single record.
This process is called deduplication of the customer file. Employee files and,
sometimes, product master files have this kind of duplication problem.
MINI
Mainframe
Unix
Integrating the data is the combining of all the relevant operational data into coherent data structures to be made
ready for loading into the data warehouse. You may need to consider data integration and consolidation as a type
of pre-process before other major transformation routines are applied. You have to standardise the names and data
representations and resolve discrepancies in the ways in which same data is represented in different source systems.
Although time-consuming, many of the data integration tasks can be managed. However, let us go over a couple
of more difficult challenges.
36/JNU OLE
Entity Identification Problem
If you have three different legacy applications developed in your organisation at different times in the past, you
are likely to have three different customer files supporting those systems. One system may be the old order entry
system, the second the customer service support system, and the third the marketing system. Most of the customers
will be common to all three files. The same customer on each of the files may have a unique identification number.
These unique identification numbers for the same customer may not be the same across the three systems. This is a
problem of identification in which you do not know which of the customer records relate to the same customer. But
in the data warehouse you need to keep a single record for each customer. You must be able to get the activities of
the single customer from the various source systems and then match up with the single record to be loaded to the
data warehouse. This is a common but very difficult problem in many enterprises where applications have evolved
over time from the distant past. This type of problem is prevalent where multiple sources exist for the same entities.
Vendors, suppliers, employees, and sometimes products are the kinds of entities that are prone to this type of problem.
In the above example of the three customer files, you have to design complex algorithms to match records from all
the three files and form groups of matching records. No matching algorithm can completely determine the groups.
If the matching criteria are too tight, then some records will escape the groups. On the other hand, if the matching
criteria are too loose, a particular group may include records of more than one customer. You need to get your users
involved in reviewing the exceptions to the automated procedures. You have to weigh the issues relating to your
source systems and decide how to handle the entity identification problem. Every time a data extract function is
performed for your data warehouse, which may be every day, do you pause to resolve the entity identification problem
before loading the data warehouse? How will this affect the availability of the data warehouse to your users? Some
companies, depending on their individual situations, take the option of solving the entity identification problem in
two phases. In the first phase, all records, irrespective of whether they are duplicates or not, are assigned unique
identifiers. The second phase consists of reconciling the duplicates periodically through automatic algorithms and
manual verification.
You need to know from which system you should get the cost for storing in the data warehouse. For the same, a
straightforward solution is to assign a higher priority to one of the two sources and pick up the product unit cost from
that source. Sometimes, a straightforward solution such as this may not sit well with needs of the data warehouse
users. You may have to select from either of the files based on the last update date. Or, in some other instances, your
determination of the appropriate source depends on other related fields.
The methods you may want to adopt depend on some significant factors. If you are considering automating most of
the data transformation functions, first consider if you have the time to select the tools, configure and install them,
train the project team on the tools, and integrate the tools into the data warehouse environment. Data transformation
tools can be expensive. If the scope of your data warehouse is modest, then the project budget may not have room
for transformation tools.
In many cases, a suitable combination of both methods will prove to be effective. Find the proper balance based on
the available time frame and the money in the budget.
37/JNU OLE
Data Mining
Use of automated tools certainly improves efficiency and accuracy. As a data transformation specialist, you just
have to specify the parameters, the data definitions, and the rules to the transformation tool. If your input into the
tool is accurate, then the rest of the work is performed efficiently by the tool. You gain a major advantage from
using a transformation tool because of the recording of metadata by the tool. When you specify the transformation
parameters and rules, these are stored as metadata by the tool. This metadata then becomes part of the overall metadata
component of the data warehouse. It may be shared by other components. When changes occur to transformation
functions because of changes in business rules or data definitions, you just have to enter the changes into the tool.
The metadata for the transformations get automatically adjusted by the tool.
A major disadvantage relates to metadata. Automated tools record their own metadata, but in-house programs have
to be designed differently if you need to store and use metadata. Even if the in-house programs record the data
transformation metadata initially, each time changes occur to transformation rules, which the metadata has to be
maintained. This puts an additional burden on the maintenance of the manually coded transformation programs.
The whole process of moving data into the data warehouse repository is referred to in several ways. You must have
heard the phrases applying the data, loading the data, and refreshing the data. For the sake of clarity we will use
the phrases as indicated below:
• Initial Load—populating all the data warehouse tables for the very first time
• Incremental Load—applying ongoing changes as necessary in a periodic manner
• Full Refresh—completely erasing the contents of one or more tables and reloading with fresh data (initial load
is a refresh of all the tables)
As loading the data warehouse may take an inordinate amount of time, loads are generally caused for great concern.
During the loads, the data warehouse has to be offline. You need to find a window of time when the loads may be
scheduled without affecting your data warehouse users. Therefore, consider dividing up the whole load process into
smaller chunks and populating a few files at a time. This will give you two benefits such as you may be able to run
the smaller loads in parallel and you might also be able to keep some parts of the data warehouse up and running
while loading the other parts. It is hard to estimate the running times of the loads, especially the initial load or a
complete refresh. Do test loads to verify the correctness and to estimate the running times.
38/JNU OLE
2.9 Data Quality
Accuracy is associated with a data element. Consider an entity such as customer. The customer entity has attributes
such as customer name, customer address, customer state, customer lifestyle, and so on. Each occurrence of the
customer entity refers to a single customer. Data accuracy, as it relates to the attributes of the customer entity, means
that the values of the attributes of a single occurrence accurately describe the particular customer. The value of the
customer name for a single occurrence of the customer entity is actually the name of that customer. Data quality
implies data accuracy, but it is much more than that. Most cleansing operations concentrate on data accuracy only.
You need to go beyond data accuracy. If the data is fit for the purpose for which it is intended, we can then say such
data has quality. Therefore, data quality is to be related to the usage for the data item as defined by the users. Does
the data item in an entity reflect exactly what the user is expecting to observe? Does the data item possess fitness of
purpose as defined by the users? If it does, the data item conforms to the standards of data quality.
If the database records conform to the field validation edits, then we generally say that the database records are of
good data quality. But such single field edits alone do not constitute data quality. Data quality in a data warehouse
is not just the quality of individual data items but the quality of the full, integrated system as a whole. It is more
than the data edits on individual fields. For example, while entering data about the customers in an order entry
application, you may also collect the demographics of each customer. The customer demographics are not germane
to the order entry application and, therefore, they are not given too much attention. But you run into problems when
you try to access the customer demographics in the data warehouse the customer data as an integrated whole lacks
data quality.
The value stored in the system for a data element is the right value
for that occurrence of the data element. If you have a customer
name and an address stored in a record, then the address is the
Accuracy correct address for the customer with that name. If you find the
quantity ordered as 1000 units in the record for order number
12345678, then that quantity is the accurate quantity for that
order.
The data value of an attribute falls in the range of allowable,
Domain Integrity defined values. The common example is the allowable values
being “male” and “female” for the gender data element.
Value for a data attribute is actually stored as the data type defined
for that attribute. When the data type of the store name field is
Data Type
defined as “text,” all instances of that field contain the store name
shown in textual format and not in numeric codes.
The form and content of a data field is the same across multiple
source systems. If the product code for product ABC in one
Consistency
system is 1234, then the code for this product must be 1234 in
every source system.
The same data must not be stored in more than one place in a
system. In case, for reasons of efficiency, a data element is
Redundancy
intentionally stored in more than one place in a system, then the
redundancy must be clearly identified.
Completeness There are no missing values for a given attribute in the system.
Duplication of records in a system is completely resolved. If
the product file is known to have duplicate records, then all the
Duplication
duplicate records for each product are identified and a cross-
reference created.
39/JNU OLE
Data Mining
40/JNU OLE
delivery from an operational system. If the kinds of strategic information made available in a data warehouse were
readily available from the source systems, then we would not really need the warehouse. Data warehousing enables
the users to make better strategic decisions by obtaining data from the source systems and keeping it in a format
suitable for querying and analysis.
If the kinds of strategic information made available in a data warehouse were readily available from the source
systems, then we would not really need the warehouse. Data warehousing enables the users to make better strategic
decisions by obtaining data from the source systems and keeping it in a format suitable for querying and analysis.
Data
warehouse
helps in
Plan planning Enhance
marketing campaigns
campaigns based in
Planning results
Data
Execution Assessment
warehouse
helps
assess
results
Execute Assess
marketing result of
campaign campaign
Fig. 2.13 Enterprise plan-execute-assess closed loop
Assessment of the results determines the effectiveness of the campaigns. Based on the assessment of the results,
more plans may be made to vary the composition of the campaigns or launch additional ones. The cycle of planning,
executing, and assessing continues.
41/JNU OLE
Data Mining
It is very interesting to note that the data warehouse, with its specialised information potential, fits nicely in this
plan–execute–assess loop. The data warehouse reports on the past and helps to plan the future. Initially, the data
warehouse assists in the planning. Once the plans are executed, the data warehouse is used to assess the effectiveness
of the execution.
Table 2.4 General areas where data warehouse can assist in the planning and assessment phases
42/JNU OLE
Summary
• Data design consists of putting together the data structures. A group of data elements form a data structure.
• Logical data design includes determination of the various data elements that are needed and combination of the
data elements into structures of data. Logical data design also includes establishing the relationships among
the data structures.
• Many case tools are available for data modelling. These tools can be used for creating the logical schema and
the physical schema for specific target database management systems (DBMS).
• Another very useful function found in the case tools is the ability to forward-engineer the model and generate
the schema for the target database system you need to work with.
• Creating the STAR schema is the fundamental data design technique for the data warehouse. It is necessary to
gain a good grasp of this technique.
• Dimensional modelling gets its name from the business dimensions we need to incorporate into the logical data
model.
• The multidimensional information package diagram we have discussed is the foundation for the dimensional
model.
• Dimensional modelling is a technique for conceptualising and visualising data models as a set of measures that
are described by common aspects of the business.
• Source identification encompasses the identification of all the proper data sources. It does not stop with just the
identification of the data sources.
• Business transactions keep changing the data in the source systems.
• Operational data in the source system may be thought of as falling into two broad categories.
• Irrespective of the variety and complexity of the source operational systems, and regardless of the extent of
your data warehouse, you will find that most of your data transformation functions break down into a few basic
tasks.
• The whole process of moving data into the data warehouse repository is referred to in several ways. You must
have heard the phrases applying the data, loading the data, and refreshing the data.
References
• Han, J. and Kamber, M., 2006. Data Mining: Concepts and Techniques, 2nd ed., Diane Cerra.
• Kimball, R., 2006. The Data warehouse Lifecycle Toolkit, Wiley-India.
• Mento, B. and Rapple, B., 2003. Data Mining and Warehousing [Online] Available at: <http://www.arl.org/
bm~doc/spec274webbook.pdf>. [Accessed 9 September 2011].
• Orli, R and Santos, F., 1996. Data Extraction, Transformation, and Migration Tools [Online] Available at:
<http://www.kismeta.com/extract.html>. [Accessed 9 September 2011].
• Learndatavault, 2009. Business Data Warehouse (BDW) [Video Online] Available at: <http://www.youtube.
com/watch?v=OjIqP9si1LA&feature=related>. [Accessed 12 September 2011].
• SQLUSA, 2009. SQLUSA.com Data Warehouse and OLAP [Video Online] Available at : < http://www.youtube.
com/watch?v=OJb93PTHsHo>. [Accessed 12 September 2011].
Recommended Reading
• Prabhu, C. S. R., 2004. Data warehousing: concepts, techniques, products and applications, 2nd ed., PHI
Learning Pvt. Ltd.
• Ponniah, P., 2001. Data Warehousing Fundamentals-A Comprehensive Guide for IT Professionals, Wiley-
Interscience Publication.
• Ponniah, P., 2010. Data Warehousing Fundamentals for IT Professionals, 2nd ed., John Wiley and Sons.
43/JNU OLE
Data Mining
Self Assessment
1. __________ consists of putting together the data structures.
a. Data mining
b. Data design
c. Data warehousing
d. Metadata
3. Which of the following is used to define the tables, the attributes and the relationships?
a. Metadata
b. Data warehousing
c. Data design
d. Case tools
4. Creating the __________ is the fundamental data design technique for the data warehouse.
a. STAR schema
b. Data transformation
c. Dimensional modelling
d. Data extraction
5. Each row in a dimension table is identified by a unique value of an attribute designated as the ___________of
the dimension.
a. ordinary key
b. primary key
c. surrogate key
d. foreign key
6. How many general principles are to be applied when choosing primary keys for dimension tables?
a. One
b. Two
c. Three
d. Four
44/JNU OLE
7. Which of the following keys are simply system-generated sequence numbers?
a. Ordinary key
b. Primary key
c. Surrogate key
d. Foreign key
8. Which of the following is a logical design technique to structure the business dimensions and the metrics that
are analysed along these dimensions?
a. Mapping
b. Data extraction
c. Dimensional modelling
d. E-R modelling
9. Which technique is adopted to create the data models for these systems?
a. E-R modelling
b. Dimensional modelling
c. Source identification
d. Data extraction
45/JNU OLE
Data Mining
Chapter III
Data Mining
Aim
The aim of this chapter is to:
Objectives
The objectives of this chapter are to:
Learning outcome
At the end of this chapter, you will be able to:
46/JNU OLE
3.1 Introduction
Data mining refers to the process of finding interesting patterns in data that are not explicitly part of the data. The
interesting patterns can be used to make predictions. The process of data mining is composed of several steps
including selecting data to analyse, preparing the data, applying the data mining algorithms, and then interpreting
and evaluating the results. Sometimes, the term, data mining, refers to the step in which the data mining algorithms
are applied. This has created a fair amount of confusion in the literature. But more often the term is used to refer
the entire process of finding and using interesting patterns in data.
The application of data mining techniques was first applied to databases. A better term for this process is KDD
(Knowledge Discovery in Databases). Benoît (2002) offers this definition of KDD (which he refers to as data
mining): Data mining (DM) is a multistage process of extracting previously unanticipated knowledge from large
databases, and applying the results to decision making. Data mining tools detect patterns from the data and infer
associations and rules from them. The extracted information may then be applied to prediction or classification
models by identifying relations within the data records or between databases. Those patterns and rules can then
guide decision making and forecast the effects of those decisions.
Data mining techniques can be applied to a wide variety of data repositories including databases, data warehouses,
spatial data, multimedia data, Internet or Web-based data and complex objects. A more appropriate term for describing
the entire process would be knowledge discovery, but unfortunately the term data mining is what has caught on.
The following figure shows data mining as a step in an iterative knowledge discovery process.
Pattern
Evaluation Knowledge
ining
t a M
Da
Task-relevant
Data
Data
warehouse Selection and
transformation
Data
Cleaning
Data integration
Database
The Knowledge Discovery in Databases process comprises of a few steps leading from raw data collections to some
form of new knowledge. The iterative process consists of the following steps:
• Data cleaning: This is also known as data cleansing. This is a phase in which noise data and irrelevant data are
removed from the collection.
• Data integration: At this stage, multiple data sources, often heterogeneous, may be combined in a common
source.
• Data selection: At this step, the data relevant to the analysis is decided on and retrieved from the data
collection.
47/JNU OLE
Data Mining
• Data transformation: This is also known as data consolidation, it is a phase in which the selected data is
transformed into forms appropriate for the mining procedure.
• Data mining: It is the crucial step in which clever techniques are applied to extract patterns potentially useful.
• Pattern evaluation: In this step, strictly interesting patterns representing knowledge are identified based on
given measures.
• Knowledge representation: This is the final phase in which the discovered knowledge is visually represented to
the user. This essential step uses visualisation techniques to help users understand and interpret the data mining
results.
It is common to combine some of these steps together. For instance, data cleaning and data integration can be
performed together as a pre-processing phase to generate a data warehouse. Data selection and data transformation
can also be combined where the consolidation of the data is the result of the selection, or, as for the case of data
warehouses, the selection is done on transformed data.
The KDD is an iterative process. Once the discovered knowledge is presented to the user, the evaluation measures
can be enhanced, the mining can be further refined, new data can be selected or further transformed, or new data
sources can be integrated, in order to get different, more appropriate results.
Data mining derives its name from the similarities between searching for valuable information in a large database and
mining rocks for a vein of valuable ore. Both imply either sifting through a large amount of material or ingeniously
probing the material to exactly pinpoint where the values reside. It is, however, a misnomer, since mining for gold in
rocks is usually called “gold mining” and not “rock mining”, thus, by analogy, data mining should have been called
“knowledge mining” instead. Nevertheless, data mining became the accepted customary term, and very rapidly a
trend that even overshadowed more general terms such as knowledge discovery in databases (KDD) that describe
a more complete process. Other similar terms referring to data mining are: data dredging, knowledge extraction
and pattern discovery.
The ongoing remarkable growth in the field of data mining and knowledge discovery has been fuelled by a fortunate
confluence of a variety of factors:
• The explosive growth in data collection, as exemplified by the supermarket scanners above
• The storing of the data in data warehouses, so that the entire enterprise has access to a reliable current
database
• The availability of increased access to data from Web navigation and intranets
• The competitive pressure to increase market share in a globalised economy
• The development of off-the-shelf commercial data mining software suites
• The tremendous growth in computing power and storage capacity
48/JNU OLE
trees found in the different samples, and to apply some simple voting. The final classification is the one most often
predicted by the different trees. Note that some weighted combination of predictions (weighted vote, weighted
average) is also possible, and commonly used. A sophisticated (machine learning) algorithm for generating weights
for weighted prediction or voting is the Boosting procedure.
3.2.2 Boosting
The concept of boosting applies to the area of predictive data mining, to generate multiple models or classifiers
(for prediction or classification), and to derive weights to combine the predictions from those models into a single
prediction or predicted classification.
A simple algorithm for boosting works like this: Start by applying some method (For example, a tree classifier
such as C&RT or CHAID) to the learning data, where each observation is assigned an equal weight. Compute the
predicted classifications, and apply weights to the observations in the learning sample that are inversely proportional
to the accuracy of the classification. In other words, assign greater weight to those observations that were difficult
to classify (where the misclassification rate was high), and lower weights to those that were easy to classify (where
the misclassification rate was low). In the context of C&RT for example, different misclassification costs (for the
different classes) can be applied, inversely proportional to the accuracy of prediction in each class. Then apply the
classifier again to the weighted data (or with different misclassification costs), and continue with the next iteration
(application of the analysis method for classification to the re-weighted data).
Boosting will generate a sequence of classifiers, where each consecutive classifier in the sequence is an “expert”
in classifying observations that were not well classified by those preceding it. During deployment (for prediction
or classification of new cases), the predictions from the different classifiers can then be combined (example, via
voting, or some weighted voting procedure) to derive a single best prediction or classification.
Note that boosting can also be applied to learning methods that do not explicitly support weights or misclassification
costs. In that case, random sub-sampling can be applied to the learning data in the successive steps of the iterative
boosting procedure, where the probability for selection of an observation into the subsample is inversely proportional
to the accuracy of the prediction for that observation in the previous iteration (in the sequence of iterations of the
boosting procedure).
3.2.5 Deployment
The concept of deployment in predictive data mining refers to the application of a model for prediction or classification
to new data. After a satisfactory model or set of models has been identified (trained) for a particular application, we
usually want to deploy those models so that predictions or predicted classifications can quickly be obtained for new
data. For example, a credit card company may want to deploy a trained model or set of models (for example, neural
networks, meta-learner) to quickly identify transactions which have a high probability of being fraudulent.
49/JNU OLE
Data Mining
Feature selection selects a subset of predictors from a large list of candidate predictors without assuming that the
relationships between the predictors and the dependent or outcome variables of interest are linear, or even monotone.
Therefore, this is used as a pre-processor for predictive data mining, to select manageable sets of predictors that are
likely related to the dependent (outcome) variables of interest, for further analyses with any of the other methods
for regression and classification.
3.2.9 Meta-Learning
The concept of meta-learning applies to the area of predictive data mining, to combine the predictions from multiple
models. It is particularly useful when the types of models included in the project are very different. In this context,
this procedure is also referred to as Stacking (Stacked Generalisation).
50/JNU OLE
One such model, CRISP (Cross-Industry Standard Process for data mining) was proposed in the mid-1990s by a
European consortium of companies to serve as a non-proprietary standard process model for data mining. This
general approach postulates the following (perhaps not particularly controversial) general sequence of steps for
data mining projects:
Evaluation
Deployment
Another approach - the Six Sigma methodology - is a well-structured, data-driven methodology for eliminating
defects, waste, or quality control problems of all kinds in manufacturing, service delivery, management, and other
business activities. This model has recently become very popular (due to its successful implementations) in various
American industries, and it appears to gain favour worldwide. It postulated a sequence of, so-called, DMAIC steps
that grew up from the manufacturing, quality improvement, and process control traditions and is particularly well
suited to production environments (including “production of services,” that is service industries).
Another framework of this kind (actually somewhat similar to Six Sigma) is the approach proposed by SAS Institute
called SEMMA, which is focusing more on the technical activities typically involved in a data mining project.
All of these models are concerned with the process of how to integrate data mining methodology into an organisation,
how to “convert data into information,” how to involve important stake-holders, and how to disseminate the
information in a form that can easily be converted by stake-holders into resources for strategic decision making.
Some software tools for data mining are specifically designed and documented to fit into one of these specific
frameworks.
51/JNU OLE
Data Mining
According to CRISP–DM, a given data mining project has a life cycle consisting of six phases, as illustrated in
the fig. 3.5. Note that the phase sequence is adaptive. That is, the next phase in the sequence often depends on the
outcomes associated with the preceding phase. The most significant dependencies between phases are indicated by
the arrows. For example, suppose that we are in the modelling phase. Depending on the behaviour and characteristics
of the model, we may have to return to the data preparation phase for further refinement before moving forward to
the model evaluation phase.
The iterative nature of CRISP is symbolised by the outer circle in the figure 3.5. Often, the solution to a particular
business or research problem leads to further questions of interest, which may then be attacked using the same
general process as before.
52/JNU OLE
Business/Research Data understanding
Understanding phase phase
Following is an outline of each phase. Although conceivably, issues encountered during the evaluation phase can
send the analyst back to any of the previous phases for amelioration, for simplicity we show only the most common
loop, back to the modelling phase.
Phases Explanation
The first phase in the CRISP–DM standard process
may also be termed the research understanding
phase.
• Enunciate the project objectives and
requirements clearly in terms of the business
Business understanding phase or research unit as a whole.
• Translate these goals and restrictions into
the formulation of a data mining problem
definition.
• Prepare a preliminary strategy for achieving
these objectives.
53/JNU OLE
Data Mining
54/JNU OLE
• Make use of the models created: Model creation
does not signify the completion of a project.
• Example of a simple deployment: Generate a
report.
Deployment phase • Example of a more complex deployment:
Implement a parallel data mining process in
another department.
• For businesses, the customer often carries out
the deployment based on your model.
As such networks have been studied extensively in the context of social networks, their analysis has often been
referred to as social network analysis. Furthermore, in a relational database, objects are semantically linked across
multiple relations. Mining in a relational database often requires mining across multiple interconnected relations,
which is similar to mining in connected graphs or networks. Such kind of mining across data relations is considered
multirelational data mining. Data mining techniques can be classified in the following diagram.
Graph Mining
Data Mining
Techniques
Social
Multirelational
Network
Data Mining
Analysis
55/JNU OLE
Data Mining
an active and important theme in data mining. Among the various kinds of graph patterns, frequent substructures are
the very basic patterns that can be discovered in a collection of graphs. They are useful for characterising graph sets,
discriminating different groups of graphs, classifying and clustering graphs, building graph indices, and facilitating
similarity search in graph databases.
Recent studies have developed several graph mining methods and applied them to the discovery of interesting
patterns in various applications. For example, there are reports on the discovery of active chemical structures in
HIV-screening datasets by contrasting the support of frequent graphs between different classes. There have been
studies on the use of frequent structures as features to classify chemical compounds, on the frequent graph mining
technique to study protein structural families, on the detection of considerably large frequent subpathways in metabolic
networks, and on the use of frequent graph patterns for graph indexing and similarity search in graph databases.
Although graph mining may include mining frequent subgraph patterns, graph classification, clustering, and other
analysis tasks, in this section we focus on mining frequent subgraphs. Following figure explains the methods for
Mining Frequent Subgraphs.
Methods for
Mining Frequent
Subgraphs
Apriori-based Pattern-Growth
Approach Approach
Examples include electrical power grids, telephone call graphs, the spread of computer viruses, the World Wide
Web, and co-authorship and citation networks of scientists. Customer networks and collaborative filtering problems
(where product recommendations are based on the preferences of other customers) are other examples. In biology,
examples range from epidemiological networks, cellular and metabolic networks, and food webs, to the neural
network of the nematode worm Caenorhabditis elegans (the only creature whose neural network has been completely
mapped). The exchange of e-mail messages within corporations, newsgroups, chat rooms, friendships, sex webs
(linking sexual partners), and the quintessential “old-boy” network (that is the overlapping boards of directors of
the largest companies in the United States) are examples from sociology.
56/JNU OLE
• Densification power law: Previously, it was believed that as a network evolves, the number of degrees grows
linearly in the number of nodes. This was known as the constant average degree assumption. However, extensive
experiments have shown that, on the contrary, networks become increasingly dense over time with the average
degree increasing (and hence, the number of edges growing super linearly in the number of nodes). The
densification follows the densification power law (or growth power law), which states,
• where e(t) and n(t), respectively, represent the number of edges and nodes of the graph at time t, and the exponent
a generally lies strictly between 1 and 2. Note that if a = 1, this corresponds to constant average degree over
time, whereas a = 2 corresponds to an extremely dense graph where each node has edges to a constant fraction
of all nodes.
• Shrinking diameter: It has been experimentally shown that the effective diameter tends to decrease as the
network grows. This contradicts an earlier belief that the diameter slowly increases as a function of network
size decreases. As an intuitive example, consider a citation network, where nodes are papers and a citation from
one paper to another is indicated by a directed edge. The out-links of a node, v (representing the papers cited by
v), are “frozen” at the moment it joins the graph. The decreasing distances between pairs of nodes consequently
appears to be the result of subsequent papers acting as “bridges” by citing earlier papers from other areas.
• Heavy-tailed out-degree and in-degree distributions: The number of out-degrees for a node tends to follow
a heavy-tailed distribution by observing the power law, 1/na, where n is the rank of the node in the order of
decreasing out-degrees and typically, 0 < a < 2. The smaller the value of a, the heavier the tail. This phenomena
is represented in the preferential attachment model, where each new node attaches to an existing network by
a constant number of out-links, following a “rich-get-richer” rule. The in-degrees also follow a heavy-tailed
distribution, although it tends be more skewed than the out-degrees distribution.
Node out-degress
Node rank
Fig. 3.8 Heavy-tailed out-degree and in-degree distributions
The number of out-degrees (y-axis) for a node tends to follow a heavy-tailed distribution. The node rank (x-axis)
is defined as the order of deceasing out-degrees of the node.
57/JNU OLE
Data Mining
on the given proximity measure and input graph, G. A ranked list in decreasing order of score(X, Y) is produced.
This gives the predicted new links in decreasing order of confidence. The predictions can be evaluated based on real
observations on experimental data sets. The simplest approach ranks pairs, (X, Y), by the length of their shortest path
in G. This embodies the small world notion that all individuals are linked through short chains. (Since the convention
is to rank all pairs in order of decreasing score, here, and score (X, Y) is defined as the negative of the shortest path
length.) Several measures use neighbourhood information. The simplest such measure is common neighbours—the
greater the number of neighbours that X and Y have in common, the more likely X and Y are to form a link in the
future. Intuitively, if authors X and Y have never written a paper together but have many colleagues in common, the
more likely they are to collaborate in the future. Other measures are based on the ensemble of all paths between two
nodes. The Katz measure, for example, computes a weighted sum over all paths between X and Y, where shorter
paths are assigned heavier weights. All of the measures can be used in conjunction with higher-level approaches,
such as clustering. For instance, the link prediction method can be applied to a cleaned-up version of the graph, in
which spurious edges have been removed.
For example, consider a person who decides to see a particular movie and persuades a group of friends to see the
same film. Viral marketing aims to optimise the positive word-of-mouth effect among customers. It can choose to
spend more money marketing to an individual, if that person has many social connections. Thus, by considering
the interactions between customers, viral marketing may obtain higher profits than traditional marketing, which
ignores such interactions.
The growth of the Internet over the past two decades has led to the availability of many social networks that can be
mined for the purposes of viral marketing. Examples include e-mail mailing lists, UseNet groups, on-line forums,
instant relay chat (IRC), instant messaging, collaborative filtering systems, and knowledge-sharing sites. Knowledge
sharing sites allow users to offer advice or rate products to help others, typically for free. Users can rate the usefulness
or “trustworthiness” of a review, and may possibly rate other reviewers as well. In this way, a network of trust
relationships between users (known as a “web of trust”) evolves, representing a social network for mining.
An interesting phenomenon is that people more frequently respond to a message when they disagree than when they
agree. This behaviour exists in many newsgroups and is in sharp contrast to the Web page link graph, where linkage
is an indicator of agreement or common interest. Based on this behaviour, one can effectively classify and partition
authors in the newsgroup into opposite camps by analysing the graph structure of the responses.
This newsgroup classification process can be performed using a graph-theoretic approach. The quotation network
(or graph) can be constructed by building a quotation link between person i and person j. If i has quoted from an
earlier posting written by j, we can consider any bipartition of the vertices into two sets: F represents those for
an issue and A represents those against it. If most edges in a newsgroup graph represent disagreements, then the
optimum choice is to maximise the number of edges across these two sets. Because it is known that theoretically the
max-cut problem (that is maximising the number of edges to cut so that a graph is partitioned into two disconnected
subgraphs) is an NP-hard problem, we need to explore some alternative, practical solutions.
58/JNU OLE
In particular, we can exploit two additional facts that hold in our situation: (1) rather than being a general graph, our
instance is largely a bipartite graph with some noise edges added, and (2) neither side of the bipartite graph is much
smaller than the other. In such situations, we can transform the problem into a minimum-weight, approximately
balanced cut problem, which in turn can be well approximated by computationally simple spectral methods. Moreover,
to further enhance the classification accuracy, we can first manually categorise a small number of prolific posters
and tag the corresponding vertices in the graph. This information can then be used to bootstrap a better overall
partitioning by enforcing the constraint that those classified on one side by human effort should remain on that side
during the algorithmic partitioning of the graph.
Based on these ideas, an efficient algorithm was proposed. Experiments with some newsgroup data sets on several
highly debatable social topics, such as abortion, gun control, and immigration, demonstrate that links carry less
noisy information than text. Methods based on linguistic and statistical analysis of text yield lower accuracy on
such newsgroup data sets than that based on the link analysis shown earlier. This is because the vocabulary used by
the opponent sides tends to be largely identical, and many newsgroup postings consist of too-brief text to facilitate
reliable linguistic analysis.
For example, in Web page linkage, two Web pages (objects) are related if there is a hyperlink between them. A graph
of Web page linkages can be mined to identify a community or set of Web pages on a particular topic.
Most techniques for graph mining and community mining are based on a homogenous graph, that is, they assume
that only one kind of relationship exists between the objects. However, in real social networks, there are always
various kinds of relationships between the objects. Each relation can be viewed as a relation network. In this sense,
the multiple relations form a multirelational social network (also referred to as a heterogeneous social network).
Each kind of relation may play a distinct role in a particular task. Here, the different relation graphs can provide us
with different communities. To find a community with certain properties, it is necessary to identify, which relation
plays an important role in such a community. Such a relation might not exist explicitly, that is, first discover such
a hidden relation before finding the community on such a relation network.
Different users may be interested in different relations within a network. Thus, if we mine networks by assuming
only one kind of relation, we may end up missing out on a lot of valuable hidden community information, and
such mining may not be adaptable to the diverse information needs of various users. This brings us to the problem
of multirelational community mining, which involves the mining of hidden communities on heterogeneous social
networks.
59/JNU OLE
Data Mining
There are different multirelational data mining tasks, including multirelational classification, clustering, and frequent
pattern mining. Multirelational classification aims to build a classification model that utilises information in different
relations. Multirelational clustering aims to group tuples into clusters using their own attributes as well as tuples
related to them in different relations. Multirelational frequent pattern mining aims at finding patterns involving
interconnected items in different relations.
In a database for multirelational classification, there is one target relation, Rt , whose tuples are called target tuples
and are associated with class labels. The other relations are non-target relations. Each relation may have one primary
key (which uniquely identifies tuples in the relation) and several foreign keys (where a primary key in one relation
can be linked to the foreign key in another). If we assume a two-class problem, then we pick one class as the positive
class and the other as the negative class. The most important task for building an accurate multirelational classifier
is to find relevant features in different relations that help distinguish positive and negative target tuples.
The mining model that an algorithm creates can take various forms, including:
• A set of rules that describe how products are grouped together in a transaction.
• A decision tree that predicts whether a particular customer will buy a product.
60/JNU OLE
• A mathematical model that forecasts sales.
• A set of clusters that describe how the cases in a dataset are related.
3.8.1 Classification
Classification is the task of generalising known structure to apply to new data. For example, an email program might
attempt to classify an email as legitimate or spam.
Preparing data for classification: The following pre-processing steps may be applied to the data to help improve
the accuracy, efficiency, and scalability of the classification process.
Data cleaning: This refers to the pre-processing of data in order to remove or reduce noise (by applying smoothing
techniques, for example) and the treatment of missing values (for example, by replacing a missing value with the
most commonly occurring value for that attribute, or with the most probable value based on statistics). Although,
most classification algorithms have some mechanisms for handling noisy or missing data, this step can help reduce
confusion during learning.
Relevance analysis: Many of the attributes in the data may be redundant. Correlation analysis can be used to identify
whether any two given attributes are statistically related. For example, a strong correlation between attributes A1
and A2 would suggest that one of the two could be removed from further analysis. A database may also contain
irrelevant attributes. Attribute subset can be used in these cases to find a reduced set of attributes such that the
resulting probability distribution of the data classes is as close as possible to the original distribution obtained using
all attributes. Hence, relevance analysis, in the form of correlation analysis and attribute subset selection, can be
used to detect attributes that do not contribute to the classification or prediction task. Including such attributes may
otherwise slow down, and possibly mislead, the learning step.
Ideally, the time spent on relevance analysis, when added to the time spent on learning from the resulting “reduced”
attribute (or feature) subset, should be less than the time that would have been spent on learning from the original
set of attributes. Hence, such analysis can help to improve classification efficiency and scalability.
Data transformation and reduction: The data may be transformed by normalisation, particularly when neural
networks or methods involving distance measurements are used in the learning step. Normalisation involves scaling
all values for a given attribute so that they fall within a small specified range, such as 1:0 to 1:0, or 0:0 to 1:0. In
methods that use distance measurements, for example, this would prevent attributes with initially large ranges (for
example, income) from outweighing attributes with initially smaller ranges (such as binary attributes). The data can
also be transformed by generalising it to higher-level concepts. Concept hierarchies may be used for this purpose.
This is particularly useful for continuous valued attributes. For example, numeric values for the attribute income
can be generalised to discrete ranges, such as low, medium, and high. Similarly, categorical attributes, like street,
can be generalised to higher-level concepts, like city. Because generalisation compresses the original training data,
fewer input/output operations may be involved during learning. Data can also be reduced by applying many other
methods, ranging from wavelet transformation and principle components analysis to discretisation techniques, such
as binning, histogram analysis, and clustering.
Bayesian classification: Bayesian classifiers are statistical classifiers. They can predict class membership probabilities,
such as the probability that a given tuple belongs to a particular class. Bayesian classification is based on Bayes’
theorem. Studies comparing classification algorithms have found a simple Bayesian classifier known as the naive
Bayesian classifier to be comparable in performance with decision tree and selected neural network classifiers.
Bayesian classifiers have also exhibited high accuracy and speed when applied to large databases. Naïve Bayesian
classifiers assume that the effect of an attribute value on a given class is independent of the values of the other
61/JNU OLE
Data Mining
attributes. This assumption is called class conditional independence. It is made to simplify the computations involved
and, in this sense, is considered “naïve.” Bayesian belief networks are graphical models, which unlike naïve Bayesian
classifiers allow the representation of dependencies among subsets of attributes. Bayesian belief networks can also
be used for classification.
Bayes’ theorem
Bayes’ theorem is named after Thomas Bayes, a nonconformist English clergyman who performed early work in
probability and decision theory during the 18th century. Let X be a data tuple. In Bayesian terms, X is considered
“evidence.” As usual, it is described by measurements made on a set of n attributes. Let H be some hypothesis, such
as that the data tuple X belongs to a specified class C. For classification problems, we want to determine P(H X),
the probability that the hypothesis H holds given the “evidence” or observed data tuple X. In other words, we are
looking for the probability that tuple X belongs to class C, given that we know the attribute description of X. P(H
X) is the posterior probability, or a posterior probability, of H conditioned on X.
For example, suppose our world of data tuples is confined to customers described by the attributes age and income,
respectively, and that X is a 35-year-old customer with an income of $40,000. Suppose that H is the hypothesis that
our customer will buy a computer. Then P(H X) reflects the probability that customer X will buy a computer given
that we know the customer’s age and income.
In contrast, P(H) is the prior probability, or a priori probability, of H. For the above example, this is the probability
that any given customer will buy a computer, regardless of age, income, or any other information, for that matter.
The posterior probability, P(H X), is based on more information (for example, customer information) than the prior
probability, P(H), which is independent of X.
Similarly, P(X ) is the posterior probability of X conditioned on H. That is, it is the probability that a customer,
X, is 35 years old and earns $40,000, given that we know the customer will buy a computer. P(X) is the prior
probability of X. Using our example, it is the probability that a person from our set of customers is 35 years old
and earns $40,000.
“How are these probabilities estimated?” P(H), P(X ), and P(X) may be estimated from the given data, as we
shall see below. Bayes’ theorem is useful in that it provides a way of calculating the posterior probability, P(H
X), from P(H), P(X ), and P(X).
Bayes’ theorem is
for
Thus, we maximise . The classCi for which is maximised is called the maximum posteriori
hypothesis according to the Bayes’ theorem.
62/JNU OLE
• As P(X) is constant for all classes, only ) needs to be maximised. If the class prior probabilities
are not known, then it is commonly assumed that the classes are equally likely, that is, P(C1) = P(C2) = ... =
P(Cm), and we would therefore, maximise . Otherwise, we maximise . Note that the
class prior probabilities may be estimated by where is the number of training tuples
of class Ci in D.
• Given data sets with many attributes, it would be extremely computationally expensive to compute. In order to
reduce computation in evaluating , the naive assumption of class conditional independence is made.
This presumes that the values of the attributes are conditionally independent of one another, given the class
label of the tuple (that is there are no dependence relationships among the attributes). Thus,
We can easily estimate the probabilities from the training tuples. Recall
that here xk refers to the value of attribute Ak for tuple X. For each attribute, we look at whether the attribute is
categorical or continuous-valued. For instance, to compute , we consider the following:
• If Ak is categorical, then ) is the number of tuples of class Ci in D having the value xk for Ak, divided
by , the number of tuples of class Ci in D.
• If Ak is continuous-valued, then we need to do a bit more work, but the calculation is pretty straightforward. A
continuous-valued attribute is typically assumed to have a Gaussian distribution with a mean and standard
deviation , defined by
So that,
These equations may appear daunting, but we need to compute and , which are the mean (that is average)
and standard deviation, respectively, of the values of attribute Ak for training tuples of class Ci. We then plug these
two quantities into the 1st equation above, together with xk, in order to estimate
Rule-Based Classification
Rule-based classifiers are the learned model, which is represented as a set of IF-THEN rules.
63/JNU OLE
Data Mining
right-hand side) is the rule consequent. In the rule antecedent, the condition consists of one or more attribute tests
(such as age = youth, and student = yes) that are logically ANDed. The rule’s consequent contains a class prediction
(in this case, we are predicting whether a customer will buy a computer). R1 can also be written as
If the condition (that is, all of the attribute tests) in a rule antecedent holds true for a given tuple, we say that the
rule antecedent is satisfied (or simply, that the rule is satisfied) and that the rule covers the tuple.
A rule R can be assessed by its coverage and accuracy. Given a tuple, X, from a class labelled data set, D, let ncovers
be the number of tuples covered by R; ncorrect be the number of tuples correctly classified by R; and |D| be the number
of tuples in D. We can define the coverage and accuracy of R as
That is, a rule’s coverage is the percentage of tuples that are covered by the rule (that is whose attribute values hold
true for the rule’s antecedent). For a rule’s accuracy, we look at the tuples that it covers and see what percentage of
them the rule can correctly classify.
There are many sequential covering algorithms. Popular variations include AQ, CN2, and the more recent, RIPPER.
The general strategy is such that :
• Rules are learned one at a time.
• Each time a rule is learned, the tuples covered by the rule are removed.
• The process repeats on the remaining tuples.
This sequential learning of rules is in contrast to decision tree induction. Because the path to each leaf in a decision
tree corresponds to a rule, we can consider decision tree induction as learning a set of rules simultaneously.
64/JNU OLE
A basic sequential covering algorithm is shown in following figure. Here, rules are learned for one class at a time.
Ideally, when learning a rule for a class, Ci, we would like the rule to cover all (or many) of the training tuples of
class C and none (or few) of the tuples from other classes. In this way, the rules learned should be of high accuracy.
The rules need not necessarily be of high coverage. This is because we can have more than one rule for a class, so
that different rules may cover different tuples within the same class.
Input:
D, a data set class-labelled tuples;
Att_vals, the set of all attributes and their possible values.
Method:
The process continues until the terminating condition is met, such as when there are no more training tuples or the
quality of a rule returned is below a user-specified threshold. The Learn One Rule procedure finds the “best” rule
for the current class, given the current set of training tuples.
Typically, rules are grown in a general-to-specific manner. We can think of this as a beam search, where we start off
with an empty rule and then gradually keep appending attribute tests to it. We append by adding the attribute test
as a logical conjunct to the existing condition of the rule antecedent. Suppose our training set, D, consists of loan
application data. Attributes regarding each applicant include their age, income, education level, residence, credit
rating, and the term of the loan. The classifying attribute is loan decision, which indicates whether a loan is accepted
(considered safe) or rejected (considered risky). To learn a rule for the class “accept,” we start off with the most
general rule possible, that is, the condition of the rule antecedent is empty. The rule is:
65/JNU OLE
Data Mining
IF
THEN loan_decision = accept
IF income = high
IF loan_term = short IF loan_term = long IF loan_term=medium
THEN loan_decision = accept
THEN loan_decision = accept THEN loan_decision = accept THEN loan_decision = accept
IF income = high AND IF income = high AND IF income = high AND IF income = high AND
age = youth age = middle_aged credit_rating = excellent credit_rating = fair
THEN loan_decision = accept THEN loan_decision = accept THEN loan_decision = accept THEN loan_decision = accept
We then consider each possible attribute test that may be added to the rule. These can be derived from the parameter
Att_vals, which contains a list of attributes with their associated values. For example, for an attribute-value pair
(att, val), we can consider attribute tests such as att = val, att val, att val, and so on. Typically, the training data
will contain many attributes, each of which may have several possible values. Finding an optimal rule set becomes
computationally explosive.
Instead, Learn One Rule adopts a greedy depth-first strategy. Each time it is faced with adding a new attribute test
(conjunct) to the current rule, it picks the one that most improves the rule quality, based on the training samples.
We will say more about rule quality measures in a minute. For the moment, let’s say we use rule accuracy as our
quality measure. Getting back to our example with the above figure, suppose Learn One Rule finds that the attribute
test income = high best improves the accuracy of our current (empty) rule. We append it to the condition, so that
the current rule becomes
Each time we add an attribute test to a rule, the resulting rule should cover more of the “accept” tuples. During
the next iteration, we again consider the possible attribute tests and end up selecting credit rating = excellent. Our
current rule grows to become
IF income = high AND credit rating = excellent THEN loan decision = accept.
The process repeats, where at each step, we continue to greedily grow rules until the resulting rule meets an acceptable
quality level. Greedy search does not allow for backtracking. At each step, we heuristically add what appears to be
the best choice at the moment. What if we unknowingly made a poor choice along the way? In order to lessen the
chance of this happening, instead of selecting the best attribute test to append to the current rule, we can select the
best k attribute tests. In this way, we perform a beam search of width k wherein we maintain the k best candidates
overall at each step, rather than a single best candidate.
Classification by backpropagation
Backpropagation is a neural network learning algorithm. The field of neural networks was originally kindled by
psychologists and neurobiologists who sought to develop and test computational analogues of neurons. Roughly
speaking, a neural network is a set of connected input/output units in which each connection has a weight associated
with it. During the learning phase, the network learns by adjusting the weights so as to be able to predict the
66/JNU OLE
correct class label of the input tuples. Neural network learning is also referred to as connectionist learning due to
the connections between units. Neural networks involve long training times and are therefore more suitable for
applications where this is feasible. They require a number of parameters that are typically best determined empirically,
such as the network topology or “structure.” Neural networks have been criticised for their poor interpretability.
For example, it is difficult for humans to interpret the symbolic meaning behind the learned weights and of “hidden
units” in the network. These features initially made neural networks less desirable for data mining.
Each layer is made up of units. The inputs to the network correspond to the attributes measured for each training
tuple. The inputs are fed simultaneously into the units making up the input layer. These inputs pass through the input
layer and are then weighted and fed simultaneously to a second layer of “neuronlike” units, known as a hidden layer.
The outputs of the hidden layer units can be input to another hidden layer, and so on. The number of hidden layers
is arbitrary, although in practice, usually only one is used. The weighted outputs of the last hidden layer are input
to units making up the output layer, which emits the network’s prediction for given tuples. The units in the input
layer are called input units. The units in the hidden layers and output layer are sometimes referred to as neurodes,
due to their symbolic biological basis, or as output units. The multilayer neural network shown in the figure has
two layers of output units.
x1
w1j
x2
w2j
wij wjk
x3
oj ok
wnj
x4
Therefore, we say that it is a two-layer neural network. (The input layer is not counted because it serves only to
pass the input values to the next layer.) Similarly, a network containing two hidden layers is called a three-layer
neural network, and so on. The network is feed-forward in that none of the weights cycles back to an input unit or
to an output unit of a previous layer. It is fully connected in each unit, which provides input to each unit in the next
forward layer.
67/JNU OLE
Data Mining
Normalising the input values for each attribute measured in the training tuples will help speed up the learning phase.
Typically, input values are normalised so as to fall between 0.0 and 1.0. Discrete-valued attributes may be encoded
such that there is one input unit per domain value. For example, if an attribute A has three possible or known values,
namely {a0, a1, a2}, then we may assign three input units to represent A. That is, we may have, say, I0, I1, I2 as input
units. Each unit is initialised to 0. If A=a0, then I0 is set to 1. If A = a1, I1 is set to 1, and so on. Neural networks can be
used for either classification (to predict the class label of a given tuple) or prediction (to predict a continuous-valued
output). For classification, one output unit may be used to represent two classes (where the value 1 represents one
class and the value 0 represents the other). If there are more than two classes, then one output unit per class is used.
There are no clear rules as to the “best” number of hidden layer units. Network design is a trial-and-error process
and may affect the accuracy of the resulting trained network.
The initial values of the weights may also affect the resulting accuracy. Once a network has been trained and its
accuracy is not considered acceptable, it is common to repeat the training process with a different network topology
or a different set of initial weights. Cross-validation techniques for accuracy estimation can be used to help to decide
when an acceptable network has been found. A number of automated techniques have been proposed that search
for a “good” network structure. These typically use a hill-climbing approach that starts with an initial structure that
is selectively modified.
Backpropagation
Backpropagation learns by iteratively processing a data set of training tuples, comparing the network’s prediction
for each tuple with the actual known target value. The target value may be the known class label of the training
tuple (for classification problems) or a continuous value (for prediction). For each training tuple, the weights are
modified so as to minimise the mean squared error between the network’s prediction and the actual target value.
These modifications are made in the “backwards” direction, that is, from the output layer, through each hidden
layer down to the first hidden layer (hence the name backpropagation). Although it is not guaranteed, in general
the weights will eventually converge, and the learning process stops. The algorithm is summarised in the following
figure. The steps involved are expressed in terms of inputs, outputs, and errors, and may seem awkward if this is
your first look at neural network learning. However, once you become familiar with the process, you will see that
each step is inherently simple. The steps are described below.
• Algorithm: Backpropagation, Neural network learning for classification or prediction using the backpropagation
algorithm.
• Input:
D, a data set consisting of the training tuples and their associated target values;
l, the learning rate;
network, a multilayer feed-forward network.
• Output: A trained neural network.
• Method:
68/JNU OLE
8. Ij = ; //compute the net input of unit j with respect to the
previous layer, i
12. for each unit j in the hidden layers, from the last to the first hidden layer
13. Err j = Oj(1- Oj) // compute the error with respect to the
21. } }
3.8.2 Clustering
The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. A
cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the
objects in other clusters. A cluster of data objects can be treated collectively as one group and so may be considered
as a form of data compression. Although classification is an effective means for distinguishing groups or classes
of objects, it requires the often costly collection and labelling of a large set of training tuples or patterns, which
the classifier uses to model each group. It is often more desirable to proceed in the reverse direction: First partition
the set of data into groups based on data similarity (for example, using clustering), and then assign labels to the
relatively small number of groups. Additional advantages of such a clustering-based process are that it is adaptable
to changes and helps single out useful features that distinguish different groups.
Cluster analysis has been widely used in numerous applications, including market research, pattern recognition,
data analysis, and image processing. In business, clustering can help marketers discover distinct groups in their
customer bases and characterise customer groups based on purchasing patterns. In biology, it can be used to derive
plant and animal taxonomies, categorise genes with similar functionality, and gain insight into structures inherent
in populations. Clustering may also help in the identification of areas of similar land use in an earth observation
database and in the identification of groups of houses in a city according to house type, value, and geographic
location, as well as the identification of groups of automobile insurance policy holders with a high average claim
cost. It can also be used to help to classify documents on the Web for information discovery. Clustering is also
called data segmentation in some applications because clustering partitions large data sets into groups according
69/JNU OLE
Data Mining
to their similarity. Clustering can also be used for outlier detection, where outliers (values that are “far away” from
any cluster) may be more interesting than common cases.
Applications of outlier detection include the detection of credit card fraud and the monitoring of criminal activities
in electronic commerce. For example, exceptional cases in credit card transactions, such as very expensive and
frequent purchases, may be of interest as possible fraudulent activity. As a data mining function, cluster analysis
can be used as a stand-alone tool to gain insight into the distribution of data, to observe the characteristics of each
cluster, and to focus on a particular set of clusters for further analysis. Alternatively, it may serve as a preprocessing
step for other algorithms, such as characterisation, attribute subset selection, and classification, which would then
operate on the detected clusters and the selected attributes or features.
Data clustering is under vigorous development. Contributing areas of research include data mining, statistics,
machine learning, spatial database technology, biology, and marketing. Owing to the huge amounts of data collected
in databases, cluster analysis has recently become a highly active topic in data mining research. As a branch of
statistics, cluster analysis has been extensively studied for many years, focusing mainly on distance-based cluster
analysis. Cluster analysis tools based on k-means, k-medoids, and several other methods have also been built into
many statistical analysis software packages or systems, such as S-Plus, SPSS, and SAS. In machine learning,
clustering is an example of unsupervised learning. Unlike classification, clustering and unsupervised learning do not
rely on predefined classes and class-labelled training examples. For this reason, clustering is a form of learning by
observation, rather than learning by examples. In data mining, efforts have focused on finding methods for efficient
and effective cluster analysis in large databases. Active themes of research focus on the scalability of clustering
methods, the effectiveness of methods for clustering complex shapes and types of data, high-dimensional clustering
techniques, and methods for clustering mixed numerical and categorical data in large databases.
Clustering is a challenging field of research in which its potential applications pose their own special requirements.
The following are typical requirements of clustering in data mining:
70/JNU OLE
Many clustering algorithms require users to
input certain parameters in cluster analysis
(such as the number of desired clusters). The
clustering results can be quite sensitive to input
Minimal requirements for domain knowledge to
parameters. Parameters are often difficult to
determine input parameters
determine, especially for data sets containing
high-dimensional objects. This not only burdens
users, but it also makes the quality of clustering
difficult to control.
Most real-world databases contain outliers or
missing, unknown, or erroneous data. Some
Ability to deal with noisy data
clustering algorithms are sensitive to such data
and may lead to clusters of poor quality.
Some clustering algorithms cannot incorporate
newly inserted data (that is database updates) into
existing clustering structures and, instead, must
determine a new clustering from scratch. Some
clustering algorithms are sensitive to the order of
Incremental clustering and insensitivity to the
input data. That is, given a set of data objects, such
order of input records
an algorithm may return dramatically different
clustering depending on the order of presentation
of the input objects. It is important to develop
incremental clustering algorithms and algorithms
that are insensitive to the order of input.
A database or a data warehouse can contain
several dimensions or attributes. Many clustering
algorithms are good at handling low-dimensional
data, involving only two to three dimensions.
High Dimensionality Human eyes are good at judging the quality of
clustering for up to three dimensions. Finding
clusters of data objects in high dimensional space
is challenging, especially considering that such
data can be sparse and highly skewed.
Real-world applications may need to perform
clustering under various kinds of constraints.
Suppose that your job is to choose the locations
for a given number of new automatic banking
machines (ATMs) in a city. To decide upon this,
Constraint-based clustering you may cluster households while considering
constraints such as the city’s rivers and highway
networks, and the type and number of customers
per cluster. A challenging task is to find groups of
data with good clustering behaviour that satisfy
specified constraints.
Users expect clustering results to be interpretable,
comprehensible, and usable. That is, clustering may
need to be tied to specific semantic interpretations
Interpretability and usability
and applications. It is important to study how an
application goal may influence the selection of
clustering features and methods.
71/JNU OLE
Data Mining
where d(i, j) is the measured difference or dissimilarity between objects i and j. In general, d(i, j) is a nonnegative
number that is close to 0 when objects i and j are highly similar or “near” each other, and becomes larger the more
they differ. Since d(i, j)=d( j, i), and d(i, i)=0, we have the above matrix.
The rows and columns of the data matrix represent different entities, while those of the dissimilarity matrix represent
the same entity. Thus, the data matrix is often called a two-mode matrix, whereas the dissimilarity matrix is called a
one-mode matrix. Many clustering algorithms operate on a dissimilarity matrix. If the data are presented in the form
of a data matrix, it can first be transformed into a dissimilarity matrix before applying such clustering algorithms.
Partitioning methods
To achieve global optimality in partitioning-based clustering, we would require the exhaustive enumeration of all of
the possible partitions. The heuristic clustering methods work well for finding spherical-shaped clusters in small to
medium-sized databases. To find clusters with complex shapes and for clustering very large data sets, partitioning-
based methods need to be extended. The most well-known and commonly used partitioning methods are k-means,
k-medoids, and their variations.
72/JNU OLE
k-means algorithm
The k-means algorithm proceeds as follows:
First, it randomly selects k of the objects, each of which initially represents a cluster mean or centre. For each of
the remaining objects, an object is assigned to the cluster to which it is the most similar, based on the distance
between the object and the cluster mean. It then computes the new mean for each cluster. This process iterates until
the criterion function converges. Typically, the square-error criterion is used, defined as
Where, E is the sum of the square error for all objects in the data set; p is the point in space representing a given
object; and mi is the mean of cluster Ci (both p and mi are multidimensional). In other words, for each object in each
cluster, the distance from the object to its cluster centre is squared, and the distances are summed. This criterion
tries to make the resulting k clusters as compact and as separate as possible.
k-medoids algorithm
Each remaining object is clustered with the representative object to which it is the most similar. The partitioning
method is then performed based on the principle of minimising the sum of the dissimilarities between each object
and its corresponding reference point. That is, an absolute-error criterion is used, defined as
where E is the sum of the absolute error for all objects in the data set; p is the point in space representing a given
object in cluster Cj; and oj is the representative object of Cj. In general, the algorithm iterates until, eventually, each
representative object is actually the medoid, or most centrally located object, of its cluster. This is the basis of the
k-medoids method for grouping n objects into k clusters.
Case 1: p currently belongs to representative object, oj. If oj is replaced by orandom as a representative object and p is
closest to one of the other representative objects, oi, i j, then p is reassigned to oi.
Case 2: p currently belongs to representative object, oj. If oj is replaced by orandom as a representative object and p is
closest to orandom, then p is reassigned to orandom.
Case 3: p currently belongs to representative object, oi, i j. If oj is replaced by orandom as a representative object and
p is still closest to oi, then the assignment does not change.
Hierarchical methods
A hierarchical method creates a hierarchical decomposition of the given set of data objects. A hierarchical method
can be classified as being either agglomerative or divisive, based on how the hierarchical decomposition is formed.
The agglomerative approach, also called the bottom-up approach, starts with each object forming a separate group.
It successively merges the objects or groups that are close to one another, until all of the groups are merged into one
73/JNU OLE
Data Mining
(the topmost level of the hierarchy), or until a termination condition holds. The divisive approach, also called the
top-down approach, starts with all of the objects in the same cluster. In each successive iteration, a cluster is split
up into smaller clusters, until eventually each object is in one cluster, or until a termination condition holds.
Density-based methods
Most partitioning methods cluster objects based on the distance between objects. Such methods can find only spherical-
shaped clusters and encounter difficulty at discovering clusters of arbitrary shapes. Other clustering methods have
been developed based on the notion of density. Their general idea is to continue growing the given cluster as long
as the density (number of objects or data points) in the “neighbourhood” exceeds some threshold; that is, for each
data point within a given cluster, the neighbourhood of a given radius has to contain at least a minimum number of
points. Such a method can be used to filter out noise (outliers) and discover clusters of arbitrary shape.
74/JNU OLE
Grid-based methods
Grid-based methods quantise the object space into a finite number of cells that form a grid structure. All of the
clustering operations are performed on the grid structure (that is on the quantised space). The main benefit of this
approach is its fast processing time, which is typically independent of the number of data objects and dependent
only on the number of cells in each dimension in the quantised space. STING is a typical example of a grid-based
method.
1st layer
(i-1)-st layer
ith layer
75/JNU OLE
Data Mining
Model-based methods
Model-based methods hypothesise a model for each of the clusters and find the best fit of the data to the given model.
A model-based algorithm may locate clusters by constructing a density function that reflects the spatial distribution
of the data points. It also leads to a way of automatically determining the number of clusters based on standard
statistics, taking “noise” or outliers into account and thus yielding robust clustering methods. EMis an algorithm
that performs expectation-maximisation analysis based on statistical modelling. COBWEB is a conceptual learning
algorithm that performs probability analysis and takes concepts as a model for clusters.
Expectation-Maximisation
The EM (Expectation-Maximisation) algorithm is a popular iterative refinement algorithm that can be used for
finding the parameter estimates. It can be viewed as an extension of the k-means paradigm, which assigns an object
to the cluster with which it is most similar, based on the cluster mean. The algorithm is described as follows:
• Make an initial guess of the parameter vector: This involves randomly selecting k objects to represent the cluster
means or centres (as in k-means partitioning), as well as making guesses for the additional parameters.
g(m2, σ2)
g(m1, σ1)
Each cluster can be represented by a probability distribution, centred at a mean, and with a standard deviation. Here,
we have two clusters, corresponding to the Gaussian distributions g(m1, 1
) and g(m2, 2
), respectively, where
the dashed circles represent the first standard deviation of the distributions.
• Iteratively refine the parameters (or clusters) based on the following two steps:
(a) Expectation Step: Assign each object xi to cluster Ck with the probability
)=
where p(xi|Ck) = N(mk, Ek(xi)) follows the normal (that is Gaussian) distribution around mean, mk, with expectation,
Ek. In other words, this step calculates the probability of cluster membership of object xi, for each of the clusters.
These probabilities are the “expected” cluster memberships for object xi.
(b) Maximisation Step: Use the probability estimates from above to re-estimate (or refine) the model parameters.
This step is the “maximisation” of the likelihood of the distributions given the data.
76/JNU OLE
3.8.3 Association Rules
The goal of the techniques described in this topic is to detect relationships or associations between specific values
of categorical variables in large data sets. This is a common task in many data mining projects as well as in the data
mining subcategory text mining. These powerful exploratory techniques have a wide range of applications in many
areas of business practice and research too.
77/JNU OLE
Data Mining
This results spreadsheet shows an example of how association rules can be applied to text mining tasks. This analysis
was performed on the paragraphs (dialog spoken by the characters in the play) in the first scene of Shakespeare’s
“All’s Well That Ends Well,” after removing a few very frequent words like is, of, and so on. The values for support,
confidence, and correlation are expressed in percent.
Association Rules Networks, 3D: Association rules can be graphically summarised in 2D Association Networks,
as well as 3D Association Networks. Shown below are some (very clear) results from an analysis. Respondents in
a survey were asked to list their (up to) 3 favourite fast-foods. The association rules derived from those data are
summarised in a 3D Association Network display.
78/JNU OLE
Fig. 3.16 Association Rules Networks, 3D
(Source: http://www.statsoft.com/textbook/association-rules/)
79/JNU OLE
Data Mining
Summary
• Data mining refers to the process of finding interesting patterns in data that are not explicitly part of the data.
• Data mining techniques can be applied to a wide range of data repositories including databases, data warehouses,
spatial data, multimedia data, Internet or Web-based data and complex objects.
• The Knowledge Discovery in Databases process consists of a few steps leading from raw data collections to
some form of new knowledge.
• Data selection and data transformation can also be combined where the consolidation of the data is the result of
the selection, or, as for the case of data warehouses, the selection is done on transformed data.
• Data mining derives its name from the similarities between searching for valuable information in a large database
and mining rocks for a vein of valuable ore.
• The concept of bagging applies to the area of predictive data mining, to combine the predicted classifications
from multiple models, or from the same type of model for different learning data.
• The Cross-Industry Standard Process for Data Mining (CRISP–DM) was developed in 1996, by analysts
representing DaimlerChrysler, SPSS, and NCR. CRISP provides a non-proprietary and freely available standard
process for fitting data mining into the general problem-solving strategy of a business or research unit.
• With the increasing demand on the analysis of large amounts of structured data, graph mining has become an
active and important theme in data mining.
• The graph is typically very large, with nodes corresponding to objects and edges corresponding to links
representing relationships or interactions between objects. Both nodes and links have attributes.
• Social networks are rarely static. Their graph representations evolve as nodes and edges are added or deleted
over time.
• Multirelational data mining (MRDM) methods search for patterns that involve multiple tables (relations) from
a relational database.
• The data mining algorithm is the mechanism that creates mining models.
• Classification is the task of generalising known structure to apply to new data.
• A cluster of data objects can be treated collectively as one group and so may be considered as a form of data
compression.
• The goal of the techniques described in this topic is to detect relationships or associations between specific
values of categorical variables in large data sets. This is a common task in many data mining projects as well
as in the data mining subcategory text mining.
References
• Seifert, J, W., 2004. Data Mining: An Overview [Online PDF] Available at: <http://www.fas.org/irp/crs/RL31798.
pdf>. [Accessed 9 September 2011].
• Alexander, D., Data Mining [Online] Available at: <http://www.laits.utexas.edu/~norman/BUS.FOR/course.
mat/Alex/>. [Accessed 9 September 2011].
• Han, J., Kamber, M. and Pei, J., 2011. Data Mining: Concepts and Techniques, 3rd ed., Elsevier.
• Adriaans, P., 1996. Data Mining, Pearson Education India.
• StatSoft, 2010. Data Mining, Cluster Techniques - Session 28 [Video Online] Available at: < http://www.youtube.
com/watch?v=WvR_0Vs1U8w>. [Accessed 12 September 2011].
• Swallacebithead, 2010. Using Data Mining Techniques to Improve Forecasting [Video Online] Available at:
<http://www.youtube.com/watch?v=UYkf3i6LT3Q>. [Accessed 12 September 2011].
80/JNU OLE
Recommended Reading
• Chattamvelli, R., 2011. Data Mining Algorithms, Alpha Science International Ltd.
• Thuraisingham, B. M., 1999. Data mining: technologies, techniques, tools, and trends, CRC Press.
• Witten, I. H. and Frank, E., 2005. Data mining: practical machine learning tools and techniques, 2nd ed.,
Morgan Kaufmann
81/JNU OLE
Data Mining
Self Assessment
1. Which of the following refers to the process of finding interesting patterns in data that are not explicitly part
of the data?
a. Data mining
b. Data warehousing
c. Data extraction
d. Metadata
3. Which of the following is used to address the inherent instability of results while applying complex models to
relatively small data sets?
a. Boosting
b. Bagging
c. Data reduction
d. Data preparation
4. The concept of ________ in predictive data mining refers to the application of a model for prediction or
classification to new data.
a. bagging
b. boosting
c. drill-down analysis
d. deployment
82/JNU OLE
7. The Cross-Industry Standard Process for Data Mining (CRISP–DM) was developed in _______.
a. 1995
b. 1996
c. 1997
d. 1998
9. __________ algorithm is a popular iterative refinement algorithm that can be used for finding the parameter
estimates.
a. Model-based methods
b. Wavecluster
c. Expectation-Maximisation
d. Grid-based method
10. _________ is a grid-based multiresolution clustering technique in which the spatial area is divided into rectangular
cells.
a. STING
b. KDD
c. DBSCAN
d. OPTICS
83/JNU OLE
Data Mining
Chapter IV
Web Application of Data Mining
Aim
The aim of this chapter is to:
Objectives
The objectives of this chapter are to:
Learning outcome
At the end of this chapter, you will be able to:
84/JNU OLE
4.1 Introduction
Knowledge Discovery in Databases, frequently abbreviated as KDD, typically encompasses more than data mining.
The knowledge discovery process comprises six phases, such as data selection, data cleansing, enrichment, data
transformation or encoding, data mining, and the reporting and display of the discovered information.
As an example, consider a transaction database maintained by a specialty consumer goods retailer. Suppose the
client data includes a customer name, zip code, phone number, date of purchase, item code, price, quantity, and
total amount. A variety of new knowledge can be discovered by KDD processing on this client database. During
data selection, data about specific items or categories of items, or from stores in a specific region or area of the
country, may be selected. The data cleansing process correct invalid zip codes or eliminate records with incorrect
phone prefixes. Enrichment typically enhances the data with additional sources of information. For example, given
the client names and phone numbers, the store may purchase other data about age, income, and credit rating and
append them to each record.
Data transformation and encoding may be done to reduce the amount of data. For instance, item codes may be
grouped in terms of product categories into audio, video, supplies, electronic gadgets, camera, accessories, and so
on. Zip codes may be aggregated into geographic regions; incomes may be divided into ranges, and so on. If data
mining is based on an existing warehouse for this retail store chain, we would expect that the cleaning has already
been applied. It is only after such preprocessing that data mining techniques are used to mine different rules and
patterns.
We can see that many possibilities exist for discovering new knowledge about buying patterns, relating factors such
as age, income group, place of residence, to what and how much the customers purchase. This information can then
be utilised to plan additional store locations based on demographics, to run store promotions, to combine items in
advertisements, or to plan seasonal marketing strategies. As this retail store example shows, data mining must be
preceded by significant data preparation before it can yield useful information that can directly influence business
decisions. The results of data mining may be reported in a variety of formats, such as listings, graphic outputs,
summary tables, or visualisations.
Business
Case
Definition
Data Mining
85/JNU OLE
Data Mining
The term data mining is popularly being used in a very broad sense. In some situations it includes statistical analysis
and constrained optimization as well as machine learning. There is no sharp line separating data mining from these
disciplines. It is beyond our scope, therefore, we need to discuss in detail the entire range of applications that make
up this vast body of work. For a detailed understanding of the area, readers are referred to specialized books devoted
to data mining.
Knowledge is often classified as inductive versus deductive. Deductive knowledge deduces new information based
on applying pre-specified logical rules of deduction on the given data. Data mining addresses inductive knowledge,
which discovers new rules and patterns from the supplied data. Knowledge can be represented in many forms: In an
unstructured sense, it can be represented by rules or propositional logic. In a structured form, it may be represented
in decision trees, semantic networks, neural networks, or hierarchies of classes or frames. It is common to describe
the knowledge discovered during data mining in five ways, as follows:
• Association rules-These rules correlate the presence of a set of items with another range of values for another
set of variables. Examples: (1) When a female retail shopper buys a handbag, she is likely to buy shoes. (2) An
X-ray image containing characteristics a and b is likely to also exhibit characteristic c.
86/JNU OLE
• Classification hierarchies-The goal is to work from an existing set of events or transactions to create a hierarchy
of classes. Examples: (I) A population may be divided into five ranges of credit worthiness based on a history
of previous credit transactions. (2) A model may be developed for the factors that determine the desirability
of location of a store on a 1-10 scale. (3) Mutual funds may be classified based on performance data using
characteristics such as growth, income, and stability.
• Sequential patterns-A sequence of actions or events is sought. Example: If a patient underwent cardiac bypass
surgery for blocked arteries and an aneurysm and later developed high blood urea within a year of surgery,
he or she is likely to suffer from kidney failure within the next 18 months. Detection of sequential patterns is
equivalent to detecting associations among events with certain temporal relationships.
• Patterns within time series-Similarities can be detected within positions of a time series of data, which is a
sequence of data taken at regular intervals such as daily sales or daily closing stock prices. Examples: (1) Stocks
of a utility company, ABC Power, and a financial company, XYZ Securities, showed the same pattern during
2002 in terms of closing stock price. (2) Two products show the same selling pattern in summer but a different
pattern in winter. (3) A pattern in solar magnetic wind may be used to predict changes in earth atmospheric
conditions.
• Clustering-A given population of events or items can be partitioned (segmented) into sets of “similar” elements.
Examples: (1) An entire population of treatment data on a disease may be divided into groups based on the
similarity of side effects produced. (2) The adult population in the United States may be categorised into five
groups from “most likely to buy” to “least likely to buy” a new product. (3) The web accesses made by a
collection of users against a set of documents (say, in a digital library) may be analysed in terms of the keywords
of documents to reveal clusters or categories of users.
For most applications, the desired knowledge is a combination of the above discussed types.
87/JNU OLE
Data Mining
scheduling. For most project management specialists, KDP and DM are not familiar terms. Therefore, these
specialists need a definition of what such projects involve and how to carry them out in order to develop a
sound project schedule.
• Knowledge discovery should follow the example of other engineering disciplines that already have established
models. A good example is the software engineering field, which is a relatively new and dynamic discipline
that exhibits many characteristics that are pertinent to knowledge discovery. Software engineering has adopted
several development models, including the waterfall and spiral models that have become well-known standards
in this area.
• There is a widely recognized need for standardization of the KDP. The challenge for modern data miners is
to come up with widely accepted standards that will stimulate major industry growth. Standardization of the
KDP model would allow the development of standardised methods and procedures, thereby enabling end users
to deploy their projects more easily. It would lead directly to project performance that is faster, cheaper, more
reliable, and more manageable. The standards would promote the development and delivery of solutions that
use business terminology rather than the traditional language of algorithms, matrices, criterions, complexities,
and the like, resulting in greater exposure and acceptability for the knowledge discovery field.
As there is some confusion about the terms data mining, knowledge discovery, and knowledge discovery in databases,
we first need to define them. Note, however, that many researchers and practitioners use DM as a synonym for
knowledge discovery; DM is also just one step of the KDP.
The knowledge discovery process (KDP), also called knowledge discovery in databases, seeks new knowledge in
some application domain. It is defined as the nontrivial process of identifying valid, novel, potentially useful, and
ultimately understandable patterns in data.
The process generalises to non-database sources of data, although it emphasizes databases as a primary source of
data. It consists of many steps (one of them is DM), each attempting to complete a particular discovery task and
each accomplished by the application of a discovery method. Knowledge discovery concerns the entire knowledge
extraction process, including how data are stored and accessed, how to use efficient and scalable algorithms to
analyze massive datasets, how to interpret and visualize the results, and how to model and support the interaction
between human and machine. It also concerns support for learning and analyzing the application domain.
Since the 1990s, various different KDPs have been developed. The initial efforts were led by academic research
but were quickly followed by industry. The first basic structure of the model was proposed by Fayyad et al. and
later improved/modified by others. The process consists of multiple steps that are executed in a sequence. Each
subsequent step is initiated upon successful completion of the previous step, and requires the result generated by
the previous step as its input. Another common feature of the proposed models is the range of activities covered,
88/JNU OLE
which stretches from the task of understanding the project domain and data, through data preparation and analysis,
to evaluation, understanding, and application of the generated results.
All the proposed models also emphasise the iterative nature of the model, in terms of many feedback loops that are
triggered by a revision process. A schematic diagram is shown in the above figure. The main differences between the
models described here are found in the number and scope of their specific steps. A common feature of all models is
the definition of inputs and outputs. Typical inputs include data in various formats, such as numerical and nominal
data stored in databases or flat files; images; video; semi-structured data, such as XML or HTML; and so on. The
output is the generated new knowledge — usually described in terms of rules, patterns, classification models,
associations, trends, statistical analysis and so on.
50
40
30
20
10
0
Understanding Understanding Preparation of Data mining Evaluation of Deployment of
of domain of domain data results results
KDDM steps
Most models follow a similar sequence of steps, while the common steps between the five are domain understanding,
data mining, and evaluation of the discovered knowledge. The nine-step model carries out the steps concerning
the choice of DM tasks and algorithms late in the process. The other models do so before preprocessing of the
data in order to obtain data that are correctly prepared for the DM step without having to repeat some of the earlier
steps. In the case of Fayyad’s model, the prepared data may not be suitable for the tool of choice, and thus a loop
back to the second, third, or fourth step may be needed. The five-step model is very similar to the six-step models,
except that it omits the data understanding step. The eight-step model gives a very detailed breakdown of steps in
the initial phases of the KDP, but it does not allow for a step concerned with applying the discovered knowledge.
Simultaneously, it recognizes the important issue of human resource identification.
A very important aspect of the KDP is the relative time spent in completing each of the steps. Evaluation of this
effort allows precise scheduling. Various estimates have been proposed by researchers and practitioners alike. Figure
4.3 shows a comparison of these different estimates. We note that the numbers given are only estimates, which are
used to quantify relative effort, and their sum may not equal 100%. The specific estimated values depend on many
factors, such as existing knowledge about the considered project domain, the skill level of human resources, and the
complexity of the problem at hand, to name just a few. The common theme of all estimates is an acknowledgment
that the data preparation step is by far the most time-consuming part of the KDP.
89/JNU OLE
Data Mining
• Validate the models: Test the model to make sure that it is producing accurate and adequate results.
• Monitor the model. Monitoring a model is essential as with passing time, it will be necessary to revalidate the
model to make sure that it is still meeting requirements. A model that works today may not work tomorrow;
therefore, it is necessary to monitor the behaviour of the model to ensure it is meeting performance standards.
Web Mining is the extraction of interesting and potentially useful patterns and implicit information from artifacts or
activity related to the WorldWide Web. There are roughly three knowledge discovery domains that pertain to web
mining: Web Content Mining, Web Structure Mining, and Web Usage Mining. Web content mining is the process
of extracting knowledge from the content of documents or their descriptions. Web document text mining, resource
discovery based on concepts indexing or agentbased technology may also fall in this category. Web structure
mining is the process of inferring knowledge from the WorldWide Web organisation and links between references
and referents in the Web. Finally, web usage mining, also known as Web Log Mining, is the process of extracting
interesting patterns in web access logs.
OLAP/
Path analysis Visualisation
tools
Knowledge
Name:
query
Address:
mechanism
Age:
Occup: Sequential
patterns
Documents
and usage Cluster and
Database classification
Attributes Intelligent
query rules agent
language
Stage 1 Stage 2
90/JNU OLE
Web analysis tools offer companies with previously unknown statistics and helpful insights into the behaviour of
their on-line customers. While the usage and popularity of such tools may continue to increase. Many e-tailers are
now demanding more useful information on their customers from the vast amounts of data generated by their web
sites.
The result of the changing paradigm of commerce, from traditional brick and mortar shop fronts to electronic
transactions over the Internet, has been the dramatic shift in the relationship between e-tailers and their customers.
There is no longer any personal contact between retailers and customers. Customers are now highly mobile and
are demonstrating loyalty only to value, often irrespective of brand or geography. A major challenge for e-tailers
is to identify and understand their new customer base. E-tailers need to learn as much as possible regarding the
behaviour, the individual tastes and the preferences of the visitors to their sites in order to remain competitive in
this new era of commerce.
Text mining is directed toward specific information provided by the customer search information in search engines.
This allows for the scanning of the entire Web to retrieve the cluster content triggering the scanning of specific Web
pages within those clusters. The results are pages relayed to the search engines through the highest level of relevance
to the lowest. Though, the search engines have the ability to provide links to Web pages by the thousands in relation
to the search content, this type of web mining enables the reduction of irrelevant information.
Web text mining is very useful when used in relation to a content database dealing with specific topics. For example,
online universities use a library system to recall articles related to their general areas of study. This specific content
database enables to pull only the information within those subjects, providing the most specific results of search
queries in search engines. This allowance of only the most relevant information being provided gives a higher quality
of results. This productivity increase is due directly to use of content mining of text and visuals.
The main uses of this type of data mining are to gather, categorise, organise and provide the best possible information
available on the WWW to the user requesting the information. This tool is imperative to scanning the many HTML
documents, images, and text provided on Web pages. The resulting information is provided to the search engines
in order of relevance giving more productive results of each search.
Web content categorization with a content database is the most significant tool to the efficient use of search engines.
A customer requesting information on a particular subject or item would otherwise have to search through thousands
of results to find the most relevant information to his query. Thousands of results through use of mining text are
reduced by this step. This eliminates the frustration and improves the navigation of information on the Web.
91/JNU OLE
Data Mining
Business uses of content mining allow the information provided on their sites to be structured in a relevance-order
site map. This enables for a customer of the Web site to access specific information without having to search the
entire site. With the use of this type of mining, data remains available through order of relativity to the query, thus
providing productive marketing.Used as a marketing tool this provides additional traffic to the Web pages of a
company’s site based on the amount of keyword relevance the pages offer to general searches. As the second section
of data mining, text mining is useful to enhance the productive uses of mining for businesses, Web designers, and
search engines operations. Organization, categorization, and gathering of the information provided by the WWW
become easier and produce results that are more productive through the use of this type of mining.
In short, the ability to conduct Web content mining enables results of search engines to reduce the flow of customer
clicks to a Web site, or particular Web pages of the site, to be accessed number of times in relevance to search
queries. The clustering and organisation of Web content in a content database enables effective navigation of the
pages by the customer and search engines. Images, content, formats and Web structure are examined to produce a
higher quality of information to the user based upon the requests made. Businesses can reduce the use of this text
mining to enhance marketing of their sites as well as the products they offer.
Structure mining uses two main problems of the World Wide Web due to its vast amount of information. The first
of these problems is irrelevant search results. Relevance of search information become misconstrued due to the
problem that search engines often only allow for low precision criteria. The second problem is the inability to index
the vast amount if information provided on the Web. This causes a low amount of recall with content mining. This
minimisation comes in part with the function of discovering the model underlying the Web hyperlink structure
provided by Web structure mining.
The main purpose for structure mining is to extract previously unknown relationships between Web pages. This
structure data mining offers use for a business to link the information of its own Web site to enable navigation and
cluster information into site maps. This allows its users the ability to access the desired information through keyword
association and content mining. Hyperlink hierarchy is also determined to path the related information within the
sites to the relationship of competitor links and connection through search engines and third party co-links.This
enables clustering of connected Web pages to establish the relationship of these pages.
On the WWW, the use of structure mining enables the determination of similar structure of Web pages by clustering
through the identification of underlying structure. This information can be used to project the similarities of web
content. The known similarities then provide ability to maintain or improve the information of a site to enable access
of web spiders in a higher ratio. The larger the amount of Web crawlers, the more beneficial to the site because of
related content to searches.
In the business world, structure mining can be quite useful in determining the connection between two or more
business Web sites. The determined connection brings forth an effective tool for mapping competing companies
through third party links such as resellers and customers. This cluster map allows for the content of the business
pages placing upon the search engine results through connection of keywords and co-links throughout the relationship
of the Web pages. This determined information will provide the proper path through structure mining to enhance
navigation of these pages through their relationships and link hierarchy of the Web sites.
92/JNU OLE
With improved navigation of Web pages on business Web sites, connecting the requested information to a search
engine becomes more effective. This stronger connection allows generating traffic to a business site to provide results
that are more productive. The more links provided within the relationship of the web pages enable the navigation
to yield the link hierarchy allowing navigation ease. This improved navigation attracts the spiders to the correct
locations providing the requested information, proving more beneficial in clicks to a particular site.
Therefore, Web mining and the use of structure mining can provide strategic results for marketing of a Web site
for production of sale. The more traffic directed to the Web pages of a particular site increases the level of return
visitation to the site and recall by search engines relating to the information or product provided by the company.
This also enables marketing strategies to provide results that are more productive through navigation of the pages
linking to the homepage of the site itself. Structure mining is a must, in order to truly utilize your website as a
business tool web.
Usage mining allows companies to produce productive information pertaining to the future of their business function
ability. Some of this information can be derived from the collective information of lifetime user value, product
cross marketing strategies and promotional campaign effectiveness. The usage data that is gathered provides the
companies with the ability to produce results more effective to their businesses and increasing of sales. Usage data
can also be effective for developing marketing skills that will out-sell the competitors and promote the company’s
services or product on a higher level.
Usage mining is valuable not only to businesses using online marketing, but also to e-businesses whose business
is based solely on the traffic provided through search engines. The use of this type of web mining helps to gather
the important information from customers visiting the site. This enables an in-depth log to complete analysis of a
company’s productivity flow. E-businesses depend on this information to direct the company to the most effective
Web server for promotion of their product or service.
This web mining also enables Web based businesses to provide the best access routes to services or other
advertisements. When a company advertises for services provided by other companies, the usage mining data allows
for the most effective access paths to these portals. In addition, there are typically three main uses for mining in
this fashion.
The first is usage processing, used to complete pattern discovery. This use is also the most difficult because only
bits of information like IP addresses, user information, and site clicks are available. With this minimal amount of
information available, it is harder to track the user through a site, being that it does not follow the user throughout
the pages of the site.
The second use is content processing, consisting of the conversion of Web information like text, images, scripts
and others into useful forms. This helps with the clustering and categorization of Web page information based on
the titles, specific content and images available.
Finally, the third use is structure processing. This consists of analysis of the structure of each page contained in a
Web site. This structure process can prove to be difficult if resulting in a new structure having to be performed for
each page.
93/JNU OLE
Data Mining
Analysis of this usage data will provide the companies with the information needed to provide an effective presence
to their customers. This collection of information may include user registration, access logs and information leading
to better Web site structure, proving to bemost valuable to company online marketing. These present some of the
advantages for external marketing of the company’s products, services and overall management.
Internally, usage mining effectively provides information to improvement of communication through intranet
communications. Developing strategies through this type of mining will allow for intranet based company databases
to be more effective through the provision of easier access paths. The projection of these paths helps to log the user
registration information giving commonly used paths the forefront to its access.
Therefore, it is easily determined that usage mining has valuable uses to the marketing of businesses and a direct
impact to the success of their promotional strategies and internet traffic. This information is gathered regularly
and continues to be analyzed consistently. Analysis of this pertinent information will help companies to develop
promotions that are more useful, internet accessibility, inter-company communication and structure, and productive
marketing skills through web usage mining.
94/JNU OLE
Summary
• The knowledge discovery process comprises six phases, such as data selection, data cleansing, enrichment, data
transformation or encoding, data mining, and the reporting and display of the discovered information.
• Data mining is typically carried out with some end goals or applications. Broadly speaking, these goals fall into
the following classes: prediction, identification, classification, and optimisation.
• The term “knowledge” is very broadly interpreted as involving some degree of intelligence.
• Deductive knowledge deduces new information based on applying pre-specified logical rules of deduction on
the given data.
• The knowledge discovery process defines a sequence of steps (with eventual feedback loops) that should be
followed to discover knowledge in data.
• The KDP model consists of a set of processing steps to be followed by practitioners when executing a knowledge
discovery project.
• Web Mining is the extraction of interesting and potentially useful patterns and implicit information from artifacts
or activity related to the WorldWide Web.
• Web analysis tools analyse and process these web server log files to produce meaningful information.
• Web analysis tools offer companies with previously unknown statistics and useful insights into the behaviour
of their on-line customers.
• Content mining is the scanning and mining of text, pictures and graphs of a Web page to determine the relevance
of the content to the search query.
• Web structure mining, one of three categories of web mining for data, is a tool used to identify the relationship
between Web pages linked by information or direct link connection.
• Web usage mining allows for the collection of Web access information for Web pages.
References
• Zaptron, 1999. Introduction to Knowledge-based Knowledge Discovery [Online] Available at: <http://www.
zaptron.com/knowledge/>. [Accessed 9 September 2011].
• Maimom, O. and Rokach, L., Introduction To Knowledge Discovery In Database [Online PDF] Available at:
<http://www.ise.bgu.ac.il/faculty/liorr/hbchap1.pdf>. [Accessed 9 September 2011].
• Maimom, O. and Rokach, L., 2005. Data mining and knowledge discovery handbook, Springer Science and
Business.
• Liu, B., 2007. Web data mining: exploring hyperlinks, contents, and usage data, Springer.
• http://nptel.iitm.ac.in, 2008. Lecture - 34 Data Mining and Knowledge Discovery [Video Online] Available at
:< http://www.youtube.com/watch?v=m5c27rQtD2E>. [Accessed 12 September 2011].
• http://nptel.iitm.ac.in, 2008. Lecture - 35 Data Mining and Knowledge Discovery Part II [Video Online] Available
at: <http://www.youtube.com/watch?v=0hnqxIsXcy4&feature=relmfu>. [Accessed 12 September 2011].
Recommended Reading
• Scime, A., 2005. Web mining: applications and techniques, Idea Group Inc (IGI).
• Han, J., Kamber, M. 2006. Data Mining: Concepts and Techniques, 2nd ed., Morgan Kaufmann.
• Chang, G., 2001. Mining the World Wide Web: an information search approach, Springer.
95/JNU OLE
Data Mining
Self Assessment
1. During__________, data about specific items or categories of items, or from stores in a specific region or area
of the country, may be selected.
a. data selection
b. data extraction
c. data mining
d. data warehousing
3. Which of the following is defined as the nontrivial process of identifying valid, novel, potentially useful, and
ultimately understandable patterns in data?
a. Knowledge discovery process
b. Data mining
c. Data warehousing
d. Web mining
5. Which of the following requires a significant project management effort that needs to be grounded in a solid
framework?
a. Knowledge discovery process
b. Knowledge discovery in database
c. Knowledge discovery project
d. Knowledge discovery
96/JNU OLE
6. __________ is the extraction of interesting and potentially useful patterns and implicit information from artifacts
or activity related to the WorldWide Web.
a. Data extraction
b. Data mining
c. Web mining
d. Knowledge discovery process
7. ____________ tools provide companies with previously unknown statistics and useful insights into the behaviour
of their on-line customers.
a. Web analysis
b. Web mining
c. Data mining
d. Web usage mining
8. What enables e-tailers to leverage their on-line customer data by understanding and predicting the behaviour
of their customers?
a. Web analysis
b. Web mining
c. Data mining
d. Web usage mining
10. What is used to identify the relationship between Web pages linked by information or direct link connection?
a. Web mining
b. Web usage mining
c. Web content mining
d. Web structure mining
97/JNU OLE
Data Mining
Chapter V
Advance topics of Data Mining
Aim
The aim of this chapter is to:
Objectives
The objectives of this chapter are to:
Learning outcome
At the end of this chapter, you will be able to:
98/JNU OLE
5.1 Introduction
The technical progress in computerised data acquisition and storage, results in the growth of vast databases. With the
continuous increase and accumulation, the huge amounts of the computerised data have far exceeded human ability
to completely interpret and use. These phenomena may be more serious in geo-spatial science. In order to understand
and make full use of these data repositories, a few techniques have been tried, for instance, expert system, database
management system, spatial data analysis, machine learning, and artificial intelligence. In 1989, knowledge discovery
in databases was further proposed. Later, in 1995, data mining also appeared. As both data mining and knowledge
discovery in databases virtually point to the same techniques, people would like to call them together, that is data
mining and knowledge discovery (DMKD). As 80% data are geo-referenced, the necessity forces people to consider
spatial characteristics in DMKD and to further develop a branch in geo-spatial science, that is SDMKD.
Spatial data are more complex, more changeable and larger that common affair datasets. Spatial dimension means each
item of data has a spatial reference where each entity occurs on the continuous surface, or where the spatial-referenced
relationship exists between two neighbour entities. Spatial data contains not only positional data and attribute data,
but also spatial relationships among spatial entities. Moreover, spatial data structure is more complex than the tables
in ordinary relational database. Besides tabular data, there are vector and raster graphic data in spatial database.
Moreover, the features of graphic data are not explicitly stored in the database. At the same time, contemporary
GIS have only basic analysis functionalities, the results of which are explicit. And it is under the assumption of
dependency and on the basis of the sampled data that geostatistics estimates at unsampled locations or make a map
of the attribute. Because the discovered spatial knowledge can support and improve spatial data-referenced decision-
making, a growing attention has been paid to the study, development and application of SDMKD.
5.2 Concepts
Spatial data mining and knowledge discovery (SDMKD) is an efficient extraction of hidden, implicit, interesting,
previously unknown, potentially useful, ultimately understandable, spatial or non-spatial knowledge (rules,
regularities, patterns, constraints) from incomplete, noisy, fuzzy, random and practical data in large spatial databases. It
is a confluence of databases technology, artificial intelligence, machine learning, probabilistic statistics, visualisation,
information science, pattern recognition and other disciplines. Understood from different viewpoints (Table 5.1),
SDMKD shows many new interdisciplinary characteristics.
99/JNU OLE
Data Mining
Table 5.1 Spatial data mining and knowledge discovery in various viewpoints
5.2.1 Mechanism
SDMKD is a process of discovering a form of rules along with exceptions at hierarchal view-angles with various
thresholds, for instance, drilling, dicing and pivoting on multidimensional databases, spatial data warehousing,
generalising, characterising and classifying entities, summarising and contrasting data characteristics, describing
rules, predicting future trends and so on. It is also a supportable process of spatial decision-making. There are two
mining granularities, such as spatial object granularity and pixel granularity.
It may be briefly partitioned into three big steps, such as data preparation (positioning mining objective, collecting
background knowledge, cleaning spatial data), data mining (decreasing data dimensions, selecting mining techniques,
discovering knowledge), and knowledge application (interpretation, evaluation and application of the discovered
knowledge). In order to discovery the confidential knowledge, it is common to use more than one technique to mine
the data sets simultaneously. Moreover, it is also suitable to select the mining techniques on the basis of the given
mining task and the knowledge to be discovered.
100/JNU OLE
A segmentation rule that groups a set of Group crime locations to find
objects together by virtue of their similarity distribution patterns.
or proximity to each other in the unknown
Clustering rule contexts what groups and how many
groups will be clustered. Organise data in
unsupervised clusters based on attribute
values.
A rule that defines whether a spatial entity Classify remote sensed images
belongs to a particular class or set in the based on spectrum and GIS data.
known contexts what classes and how
Classification rule
many classes will be classified. Organise
data in given/supervised classes based on
attribute values.
A spatiotemporal constrained rule that In summer, landslide disaster
relates spatial entities in time continuously, often happens. Land price is the
or the function dependency among the function of influential factors and
Serial rules
parameters. Analyse the trends, deviations, time.
regression, sequential pattern, and similar
sequences.
An inner trend that forecasts future values Forecast the movement trend
of some spatial variables when the temporal of landslide based on available
or spatial centre is moved to another one. monitoring data.
Predictive rule
Predict some unknown or missing attribute
values based on other seasonal or periodical
information.
Outliers that are isolated from common rules A monitoring point with much
Exceptions or derivate from other data observations bigger movement.
very much
Knowledge is a combination of rule and exception. A spatial rule is a pattern showing the intersection of two or
more spatial objects or space-depending attributes according to a particular spacing or set of arrangements (Ester,
2000). Besides the rules, during the discovering process of description or prediction, there may be some exceptions
(also named outliers) that derivate so much from other data observations. They identify and explain exceptions
(surprises). For example, spatial trend predictive modelling first discovered the centres that are local maximal
of some non-spatial attribute, then determined the (theoretical) trend of some non-spatial attribute when moving
away from the centres. Finally, few deviations found that some data were away from the theoretical trend. These
deviations may arouse suspicious that they are noise, or generated by a different mechanism. How to explain these
outliers? Traditionally, outliers’ detection has been studies via statistics, and a number of discordance tests have
been developed. Most of them treat outliers as “noise” and they try to eliminate the effects of outliers by removing
outliers or develop some outlier-resistant methods. In fact, these outliers prove the rules. In the context of data
mining, they are meaningful input signals rather than noise. In some cases, outliers represent unique characteristics
of the objects that are important to an organisation. Therefore, a piece of generic knowledge is virtually in the form
of rule plus exception.
101/JNU OLE
Data Mining
Techniques Description
Mine spatial data with randomness on the basis of stochastic probabilities.
The knowledge is represented as a conditional probability in the contexts of
Probability theory
given conditions and a certain hypothesis of truth. Also named probability
theory and mathematical statistics.
Discover sequential geometric rules from disorder data via covariance
Spatial statistics structure and variation function in the contexts of adequate samples and
background knowledge. Clustering analysis is a branch.
Mine spatial data via belief function and possibility function. It is an
Evidence theory extension of probability theory, and suitable for stochastic uncertainty
based SDMKD.
Mine spatial data with fuzziness on the basis of a fuzzy membership function
that depicts an inaccurate probability, by using fuzzy comprehensive
Fuzzy sets
evaluation, fuzzy clustering analysis, fuzzy control, fuzzy pattern
recognition and so on.
Mine spatial data with incomplete uncertainties via a pair of lower
approximation and upper approximation. Rough sets-based SDMKD is
Rough sets
also a process of intelligent decision-making under the umbrella of spatial
data.
Mine spatial data via a nonlinear, self-learning, self-suitable, parallel, and
dynamic system composed of many linked neurons in a network. The set
Neural network
of neurons collectively find out rules by continuously learning and training
samples in the network.
Search the optimised rules from spatial data via three algorithms simulating
Genetic algorithms
the replication, crossover, and aberrance of biological evolution.
Reasoning the rules via rolling down and drilling up a tree-structured map,
of which a root node is the mining task, item and branch nodes are mining
Decision tree
process, and leaf nodes are exact data sets. After pruning, the hierarchical
patterns are uncovered.
Focusing on data characteristics by analysing topological relationships,
Exploratory learning overlaying map-layers, matching images, buffering features (points, lines,
polygon) and optimising road.
Comes from machine learning. Summarise and generalise spatial data in
the context of given background that is from users or a task of SDMKD.
Spatial inductive learning The algorithms require that the training data be composed of several tuples
with various attributes. And one of the attributes of each tuples is the class
label.
Visually mine spatial data by computerised visualisation techniques that
make abstract data and complicated algorithms change into concrete
Visualisation
graphics, images, animation and so on, which user may sense directly in
eyes.
Mine data via online analytical processing and spatial data warehouse.
SOLAM Based on multidimensional view and web. It is a tested mining that
highlights executive efficiency and timely responsibility to commands.
Extract the interesting exceptions from spatial data via statistics, clustering,
Outlier detection
classification, and regression besides the common rules.
102/JNU OLE
5.3.1 SDMKD-based Image Classification
This section presents an approach to combine spatial inductive learning with Bayesian image classification in a loose
manner, which takes learning tuple as mining granularities for learning knowledge subdivide classes into subclasses,
that is pixel granularity and polygon granularity, and selects class probability values of Bayesian classification,
shape features, locations and elevations as the learning attributes. GIS data are used in training area selection for
Bayesian classification, generating learning data of two granularities, and testing area selection for classification
accuracy evaluation. The ground control points for image rectification are also chosen from GIS data. It implements
inductive learning in spatial data mining via C5.0 algorithm on the basis of learning granularities. Figure 5.1 shows
the principle of the method.
Bayesian classification
Deductive reasoning
Knowledge base
Fig. 5.1 Flow diagram of remote sensing image classification with inductive learning
In Figure 5.1, the remote sensing images are classified initially by Bayesian method before using the knowledge, and
the probabilities of each pixel to every class are retained. Secondly, inductive learning is conducted by the learning
attributes. Learning with probability simultaneously makes use of the spectral information of a pixel and the statistical
information of a class since the probability values are derived from both of them. Thirdly, the knowledge on the
attributes of general geometric features, spatial distribution patterns and spatial relationships is further discovered
from GIS database, for instance, the polygons of different classes. For example, the water areas in the classification
image are converted from pixels to polygons by raster to vector conversion, and then the location and shape features
of these polygons are calculated. Finally, the polygons are subdivided into subclasses by deductive reasoning based
on the knowledge, for example, class water is subdivided into subclasses such as river, lake, reservoir and pond.
In Figure 5.2, the final classification results are obtained by post-processing of the initial classification results by
deductive reasoning. Except the class label attribute, the attributes for deductive reasoning are the same as that in
inductive learning. The knowledge discovered by C5.0 algorithm is a group of classification rules and a default
class, and each rule is together with a confidence value between 0 and 1. According to how the rule is activated that
the attribute values match the conditions of this rule, the deductive reasoning adopts four strategies:
• If only one rule is activated, then let the final class be the same as this ruleIf several rules are activated, then
let the final class be the same as the rule with the maximum confidenceIf several rules are activated and the
confidence values are the same, then let the final class be the same as the rule with the maximum coverage of
learning samplesIf no rule is activated, then let the final class be the default class.
103/JNU OLE
Data Mining
The cloud model well integrates the fuzziness and randomness in a unified way via three numerical characteristics,
such as Expected value (Ex), Entropy (En), and Hyper-Entropy (He). In the discourse universe, Ex is the position
corresponding to the centre of the cloud gravity, whose elements are fully compatible with the spatial linguistic
concept; En is a measure of the concept coverage, that is a measure of the spatial fuzziness, which indicates how
many elements could be accepted to the spatial linguistic concept; and He is a measure of the dispersion on the
cloud drops, which can also be considered as the entropy of En. In the extreme case, {Ex, 0, 0}, denotes the concept
of a deterministic datum where both the entropy and hyper entropy equal to zero. The greater the number of cloud
drops, the more deterministic the concept. Figure 5.2 shows the three numerical characteristics of the linguistic
term “displacement is 9 millimetres (mm) around”. Given three numerical characteristics Ex, En and He, the cloud
generator can produce as many drops of the cloud as you would like.
C7(x)
05 3En
He
Ex
0 9 mm
x
Fig. 5.2 Three numerical characteristics
The above three visualisation methods are all implemented with the forward cloud generator in the context of the
given {Ex, En, He}. Despite of the uncertainty in the algorithm, the positions of cloud drops produced each time are
deterministic. Each cloud drop produced by the cloud generator is plotted deterministically according to the position.
On the other hand, it is an elementary issue in spatial data mining that spatial concept is always constructed from
the given spatial data, and spatial data mining aims to discover spatial knowledge represented by a cloud from the
database. That is, the backward cloud generator is also essential. It can be used to perform the transition from data
to linguistic terms, and may mine the integrity {Ex, En, He} of cloud drops specified by many precise data points.
Under the umbrella of mathematics, the normal cloud model is common, and the functional cloud model is more
interesting. Because it is common and useful to represent spatial linguistic atoms, the normal compatibility cloud
will be taken as an example to study the forward and backward cloud generators in the following.
104/JNU OLE
The input of the forward normal cloud generator is three numerical characteristics of a linguistic term, (Ex, En,
He), and the number of cloud-drops to be generated, N, while the output is the quantitative positions of N cloud
drops in the data space and the certain degree that each cloud-drop can represent the linguistic term. The algorithm
in details is:
• Produce a normally distributed random number En’ with mean En and standard deviation He;
• Produce a normally distributed random number x with mean Ex and standard deviation En’;
• Calculate
• Drop (xi, yi) is a cloud-drop in the discourse universe;
• Repeat the above steps until N cloud-drops are generated.
Simultaneously, the input of the backward normal cloud generator is the quantitative positions of N cloud-drops,
xi (i=1,…,N), and the certainty degree that each cloud-drop can represent a linguistic term, yi(i=1,…,N), while the
output is the three numerical characteristics, Ex, En, He, of the linguistic term represented by the N cloud-drops.
With the given algorithms of forward and backward cloud generators, it is easy to build the mapping relationship
inseparably and interdependently between qualitative concept and quantitative data. The cloud model enhances
the weakness of rigid specification and too much certainty, which comes into conflict with the human recognition
process, appeared in commonly used transition models. Moreover, it performs the interchangeable transition between
qualitative concept and quantitative data through the use of strict mathematic functions, the preservation of the
uncertainty in transition makes cloud model well meet the need of real life situation. Obviously, the cloud model is
not a simple combination of probability methods and fuzzy methods.
Spatial data radiate energies into data field. The power of the data field may be measured by its potential with a field
function. This is similar with the electric charges contribute to form the electric field that every electric charge has
effect on electric potential everywhere in the electric field. Therefore, the function of data field can be derived from
the physical fields. The potential of a point in the number universe is the sum of all data potentials.
Where, k is a constant of radiation gene, ri is the distance from the point to the position of the ith observed data, ρi
is the certainty of the ith data, and N is the amount of the data. With a higher certainty, the data may have greater
contribution to the potential in concept space. Besides them, space between the neighbour isopotential, computerised
grid density of Descartes coordinate, and so on may also make their contributions to the data field.
105/JNU OLE
Data Mining
Where N is the number of members of the population, so the above equation is the population mean. If z(k) is
binary depending on whether the kth member of the population is of a certain category or not, then the equation is
the population proportion of some specified attribute. In the case of a continuous population in region A of area |A|
then the equation would be replaced by the integral:
(eq. 2)
The design-based approach is principally used for tackling ‘how much’ questions such as estimating the above
equations. In principal, individual z(k) could be targets of inference but, because design-based estimators disregard
most of the information that is available on where the samples are located in the study area, in practice this is either
not possible or gives rise to estimators with poor properties.
k+1
k–1
s–1 s s+1
Design-based sample with one observation per strata. In the absence of spatial information the point X
in strata (k, s) would have to be estimated using the other point in the strata (k, s) even though in fact
the samples in two other strata are closer and may well provide better estimates.
106/JNU OLE
5.4.2 Model-based Approach to Sampling
The model-based approach or superpopulation approach to spatial sampling views the population of values in the
study region as but one realisation of some stochastic model. The source of randomness that is present in a sample
derives from a stochastic model. Again, the target of inference could be (eq.1) or (eq.2).
Under the superpopulation approach, (eq.1) for example now represents the mean of just one realisation. While
other realisations to be generated, (eq.1) would differ across realisations. Under this strategy, since (eq.1) is a sum
of random variables, (eq.1) is itself a random variable and it is usual to speak of predicting its value. A model-based
sampling strategy provides predictors that depend on model properties and are optimal with respect to the selected
model. Results may be dismissed if the model is subsequently rejected or disputed.
In the model-based approach it is the mean (μ) of the stochastic model assumed to have generated the realised
population that is the usual target of inference rather than a quantity such as (eq.1). This model mean can be considered
the underlying signal of which (eq.1) is a ‘noisy’ reflection. Since μ is a (fixed) parameter of the underlying stochastic
model, if it is the target of inference, it is usual to speak of estimating its value. In spatial epidemiology for example,
it is the true underlying relative risk for an area rather than the observed or realised relative risk revealed by the
specific data set that is of interest. Another important target of inference within the model-based strategy is often z(i
) – the value of Z at the location i. Since Z is a random variable it is usual to speak of predicting the value z(i ).
Temporal databases are databases that contain time-stamping information. Time-stamping can be done as
follows:
• With a valid time, this is the time that the element information is true in the real world.
• With a transaction time, this is the time that the element information is entered into the database.
• Bi-temporally, with both a valid time and a transaction time.
Time-stamping is usually applied to each tuple; however, it can be applied to each attribute as well. Databases that
support time can be divided into four categories:
• Snapshot databases: They keep the most recent version of the data. Conventional databases fall into this
category.
• Rollback databases: They support only the concept of transaction time.
• Historical databases: They support only valid time.
• Temporal databases: They support both valid and transaction times.
107/JNU OLE
Data Mining
The two types of temporal entities that can be stored in a database are:
• Interval: A temporal entity with a beginning time and an ending time.
• Event: A temporal entity with an occurrence time.
In addition to interval and event, another type of a temporal entity that can be stored in a database is a time series. A
time series consists of a series of real valued measurements at regular intervals. Other frequently used terms related
to temporal data are the following:
108/JNU OLE
admitted to the hospital in February.” The query is submitted in natural language in the user interface, and then, in
the Temporal Mediator (TM) layer, it is converted to an SQL query. It is also the job of the Temporal Mediator to
perform temporal reasoning to find the correct beginning and end dates of the SQL query.
User
Patient records
109/JNU OLE
Data Mining
Finally, an event can be considered as a special case of a temporal sequence with one time-stamped element. Similarly,
a series of events is another way to denote a temporal sequence, where the elements of the sequence are of the same
types semantically, such as earthquakes, alarms and so on.
Missing data
A problem that quite often complicates time series analysis is missing data. There are several reasons for this, such
as malfunctioning equipment, human error, and environmental conditions. The specific handling of missing data
depends on the specific application and the amount and type of missing data. There are two approaches to deal with
missing data:
• Not filling in the missing values: For example, in the similarity computation section of this chapter, we discuss
a method that computes the similarities of two time series by comparing their local slopes. If a segment of data
is missing in one of the two time series, we simply ignore that piece in our similarity computation.
• Filling in the missing value with an estimate: For example, in the case of a time series and for small numbers
of contiguous missing values, we can use data interpolation to create an estimate of the missing values, using
adjacent values as a form of imputation. The greater the distance between adjacent values used to estimate
the intermediate missing values, the greater is the interpolation error. The allowable interpolation error and,
therefore, the interpolation distance vary from application to application. The simplest type of interpolation that
gives reasonable results is linear interpolation.
Noise removal
Noise is defined as random error that occurs in the data mining process. It can be due to several factors, such as
faulty measurement equipment and environmental factors. The two methods of dealing with noise in data mining
are: binning and moving-average smoothing. In binning, the data are divided into buckets or bins of equal size.
Then the data are smoothed by using either mean, the median, or the boundaries of the bin.
110/JNU OLE
5.8.1 Data Normalisation
In data normalisation, the data are scaled so that they fall within a prespecified range, such as [0–1]. Normalisation
allows data to be transformed to the same “scale” and, therefore, allows direct comparisons among their values.
Normalisation can be examined in two easy types:
• Min-max normalisation: To do this type of normalisation, we need to know the minimum (xmin) and the maximum
(xmax) of the data:
• Z-score normalisation: Here, the mean and the standard deviation of the data are used to normalise them:
Z-score normalisation is useful, in cases of outliers in the data, that is, data points with extremely low or high values
that are not representative of the data and could be due to measurement error.
As we can see, both types of normalisation preserve the shape of the original time series, but z-score normalisation
follows the shape more closely.
0.6 0.9
0.2
0.8
Enrolment in
mechanical 0.2
engineering
Fig. 5.5 A Markov diagram that describes the probability of program enrolment changes
Here, we will use the fact that a Markov model can be represented using a graph that consists of vertices and arcs.
The vertices show states, while the arcs show transitions between the states.
A more specific example is shown in Figure 5.5, which shows the probability of changing majors in college. Thus,
we see that the probability of staying enrolled in electrical engineering is 0.6, while the probability of changing major
from electrical engineering to business studies is 0.2 and the probability of switching from electrical engineering to
mechanical engineering is also 0.2. As we can see, the sum of probabilities leaving the “Electrical eng. enrolment”
is 1.
111/JNU OLE
Data Mining
As an example, we assume we have a new sample x of unknown class with N features, where each ith feature is
denoted as xi. Then the Euclidean distance from a class sample, whose the ith feature is denoted as yi, is defined as
K-Nearest neighbours
In this type of classifier, the domain knowledge of each class is represented by all of its samples. The new sample
X, whose class is unknown, is classified to the class that has the K nearest neighbours to it. K can be 1, 2, 3, and so
on. Because all the training samples are stored for each class, this is a computationally expensive method.
Several important considerations about the nearest neighbourhood algorithm are described below:The algorithm’s
performance is affected by the choice of K. If K is small, then the algorithm can be affected by noise points. If K
is too large, then the nearest neighbours can belong to many different classes.
• The choice of the distance measure can also affect the performance of the algorithm. Some distance measures
are influenced by the dimensionality of the data. For example, the Euclidean distance’s classifying power is
reduced as the number of attributes increases.
• The error of the K-NN algorithm asymptotically approaches that of the Bayes error.
• K-NN is particularly applicable to classification problems with multimodal classes.
112/JNU OLE
Entropy shows the amount of randomness in a data set and varies from 0 to 1. If there is no amount of uncertainty
in the data, then the entropy is 0. For example, this can happen if one value has probability 1 and the others have
probability 0. If all values in the dataset are equally probable, then the amount of randomness in the data is maximised,
and entropy becomes 1. In this case, the amount of information in the data set is maximised. Here is the main idea
of the ID3 algorithm: A feature is chosen as the next level of the tree if its splitting produces the most information
gain.
A common training process for feed forward neural networks is the back-propagation process, where we go back
in the layers and modify the weights. The weight of each neuron is adjusted such that its error is reduced, where a
neuron’s error is the difference between its expected and actual outputs. The most well-known feedforward ANN is
the perceptron, which consists of only two layers (no hidden layers) and works as a binary classifier. If three or more
layers exist in the ANN (at least one hidden layer) then the perceptron is known as the multilayer perceptron.
Another type of widely used feedforward ANN is the radial-basis function ANN, which consists of three layers
and the activation function is a radial-basis function (RBF). This type of function, as the name implies, has radial
symmetry such as a Gaussian function and allows a neuron to respond to a local region of the feature space. In
other words, the activation of a neuron depends on its distance from a centre vector. In the training phase, the RBF
centres are chosen to match the training samples. Neural network classification is becoming very popular and one
of its advantages is that it is resistant to noise. The input layer consists of the attributes used in the classification,
and the output nodes correspond to the classes. Regarding hidden nodes, too many nodes lead to over-fitting while
too fewer nodes can lead to reduced classification accuracy. Originally, each arc is assigned a random weight, which
can then be modified in the learning process.
Apriori algorithm is the most extensively used algorithm for the discovery of frequent item sets and association
rules. The main concepts of the Apriori algorithm are as follows:
• Any subset of a frequent item set is a frequent item set.
• The set of item sets of size k will be called Ck.
• The set of frequent item sets that also satisfy the minimum support constraint is known as Lk. This is the seed
set used for the next pass over the data.
113/JNU OLE
Data Mining
• Ck+1 is generated by joining Lk with itself. The item sets of each pass have one more element than the item sets
of the previous pass.
• Similarly, Lk+1 is then generated by eliminating from Ck those elements that do not satisfy the minimum support
rule. As the candidate sequences are generated by starting with the smaller sequences and progressively increasing
the sequence size, Apriori is called a breadth first approach. An example is shown below, where we require the
minimum support to be two. The left column shows the transaction ID and the right column shows the items
that were purchased.
The algorithm has very good scalability as the size of the data increases. The GSP algorithm is similar to the Apriori
algorithm. It makes multiple passes over the data as shown below:
• In the first pass, it finds the frequent sequences. In other words, it finds the sequences that have minimum support.
This becomes the next seed set for the next iteration.
• At each next iteration, each candidate sequence has one more item than the seed sequence.
Two key innovations in the GSP algorithm are how candidates are generated and how candidates are counted.
Candidate Generation
The main idea here is the definition of a contiguous subsequence. Assume we are given a sequence S = {S1, S2,….
SN}, then a subsequence C is defined as a contiguous sequence if any of the following constraints are satisfied:
• C is extracted from S by dropping an item from either the beginning or the end of the sequence (S1 or SN).
• C is extracted from S by dropping an item from a sequence element. The element must have at least 2 items.
• C is a contiguous subsequence of C′ and C′ is a contiguous subsequence of S.
114/JNU OLE
Summary
• The technical progress in computerised data acquisition and storage results in the growth of vast databases.
With the continuous increase and accumulation, the huge amounts of the computerised data have far exceeded
human ability to completely interpret and use.
• Spatial data mining and knowledge discovery (SDMKD) is the efficient extraction of hidden, implicit, interesting,
previously unknown, potentially useful, ultimately understandable, spatial or non-spatial knowledge (rules,
regularities, patterns, constraints) from incomplete, noisy, fuzzy, random and practical data in large spatial
databases.
• Cloud model is a model of the uncertainty transition between qualitative and quantitative analysis, that is a
mathematical model of the uncertainty transition between a linguistic term of a qualitative concept and its
numerical representation data.
• The design-based approach or classical sampling theory approach to spatial sampling views the population of
values in the region as a set of unknown values which are, apart from any measurement error, fixed in value.
• The model-based approach or superpopulation approach to spatial sampling views the population of values in
the study region as but one realisation of some stochastic model.
• A database model depicts the way that the database management system stores the data and manages their
relations.
• A data warehouse (DW) is a repository of data that can be used in support of business decisions. Many data
warehouses have a time dimension and therefore, they support the idea of valid time.
• Temporal constraints can be either qualitative or quantitative.
• Temporal database mediator is used to discover temporal relations, implement temporal granularity conversion,
and also discover semantic relationships.
• In classification, we assume we have some domain knowledge about the problem we are trying to solve.
• A sequence is a time-ordered list of objects, in which each object consists of an item set, with an item set
consisting of all items that appear together in a transaction.
References
• Han, J., Kamber, M. and Pei, J., 2011. Data Mining: Concepts and Techniques, 3rd ed., Elsevier.
• Mitsa, T., 2009. Temporal Data Mining, Chapman & Hall/CRC.
• Dr. Krie, H. P., Spatial Data Mining [Online] Available at: <http://www.dbs.informatik.uni-muenchen.de/
Forschung/KDD/SpatialKDD/>. [Accessed 9 September 2011].
• Lin W., Orgun M, A. and Williams G. J., An Overview of Temporal Data Mining [Online PDF] Available at:
<http://togaware.redirectme.net/papers/adm02.pdf>. Accessed 9 September 2011].
• University of Magdeburg, 2007. 3D Spatial Data Mining on Document Sets [Video Online] Available at :<
http://www.youtube.com/watch?v=jJWl4Jm-yqI>. [Accessed 12 September 2011].
• Berlingerio, M., 2009. Temporal mining for interactive workflow data analysis [Video Online] Available at:
<http://videolectures.net/kdd09_berlingerio_tmiwda/>. [Accessed 12 September].
Recommended Reading
• Pujari, A. K., 2001. Data mining techniques, 4th ed., Universities Press.
• Stein, A., Shi, W. and Bijker, W., 2008. Quality aspects in spatial data mining, CRC Press.
• Roddick, J. F. and Hornsby, K., 2001. Temporal, spatial, and spatio-temporal data mining, Springer.
115/JNU OLE
Data Mining
Self Assessment
1. Which of the following statement is true?
a. The technical progress in computerised data acquisition and storage results in the growth of vast web
mining.
b. In 1998, knowledge discovery in databases was further proposed.
c. Metadata are more complex, more changeable and bigger that common affair datasets.
d. Spatial data includes not only positional data and attribute data, but also spatial relationships among spatial
entities.
2. Besides tabular data, there are vector and raster graphic data in __________.
a. metadata
b. database
c. spatial database
d. knowledge discovery
3. Which of the following is a process of discovering a form of rules plus exceptions at hierarchal view-angles
with various thresholds?
a. SDMKD
b. DMKD
c. KDD
d. EM
116/JNU OLE
5. Design-based approach is also called ____________.
a. classical sampling theory approach
b. model based approach
c. DMKD
d. SDMKD
6. Which of the following is a model of the uncertainty transition between qualitative and quantitative analysis?
a. Dimensional modelling
b. Cloud model
c. Web mining
d. SDMKD
117/JNU OLE
Data Mining
Chapter VI
Application and Trends of Data Mining
Aim
The aim of this chapter is to:
Objectives
The objectives of this chapter are to:
Learning outcome
At the end of this chapter, you will be able to:
• understand the use of data mining in different sectors such as education, telecommunication, finance and so
on
118/JNU OLE
6.1 Introduction
Data mining is the process of extraction of interesting (nontrivial, implicit, previously unknown and potentially useful)
patterns or knowledge from huge amount of data. It is the set of activities used to find new, hidden or unexpected
patterns in data or unusual patterns in data. Using information contained within data warehouse, data mining can
often provide answers to questions about an organisation that a decision maker has previously not thought to ask.
• Which products should be promoted to a particular customer?
• What is the probability that a certain customer will respond to a planned promotion?
• Which securities will be most profitable to buy or sell during the next trading session?
• What is the likelihood that a certain customer will default or pay back a schedule?
• What is the appropriate medical diagnosis for this patient?
These types of questions can be answered easily if the information hidden among the petabytes of data in your
databases can be located and utilised. In the following paragraphs, we will discuss about the applications and trends
in the fields of data mining.
For Example,
Generalisation of a set-valued attribute
Suppose that the expertise of a person is a set-valued attribute containing the set of values {tennis, hockey, NFS,
violin, prince of Persia}. This set can be generalised to a set of high-level concepts, such as {sports, music, computer
games} or into the number 5 (that is the number of activities in the set). Moreover, a count can be associated with a
generalised value to indicate how many elements are generalised to that value, as in {sports (3), music (1), computer
games (1)}, where sports (3) indicates three kinds of sports, and so on.
Example:
Spatial aggregation and approximation
Suppose that we have different pieces of land for several purposes of agricultural usage, such as the planting of
vegetables, grains, and fruits. These pieces can be merged or aggregated into one large piece of agricultural land by
a spatial merge. However, such a piece of agricultural land may contain highways, houses, and small stores. If the
majority of the land is used for agriculture, the scattered regions for other purposes can be ignored, and the whole
region can be claimed as an agricultural area by approximation.
Spatial Mining
(ODM + Spatial engine)
Original data
Materialised data
(spatial binning,
proximity, collocation
materialisation)
120/JNU OLE
6.3.1 Spatial Data Cube Construction and Spatial OLAP
As with relational data, we can integrate spatial data to construct a data warehouse that facilitates spatial data mining.
A spatial data warehouse is a subject-oriented, integrated, time variant, and non-volatile collection of both spatial
and non-spatial data in support of spatial data mining and spatial-data-related decision-making processes.
Spatial clustering methods: Spatial data clustering identifies clusters or densely populated regions, according to
some distance measurement in a large, multidimensional data set.
Spatial classification and spatial trend analysis: Spatial classification analyses spatial objects to derive classification
schemes in relevance to certain spatial properties, such as the neighbourhood of a district, highway, or river.
Example:
Spatial classification
Suppose that you would like to classify regions in a province into rich versus poor according to the average family
income. In doing so, you would like to identify the important spatial-related factors that determine a region’s
classification. Many properties are associated with spatial objects, such as hosting a university, containing interstate
highways, being near a lake or ocean, and so on. These properties can be used for relevance analysis and to find
interesting classification schemes. Such classification schemes may be represented in the form of decision trees or
rules.
121/JNU OLE
Data Mining
A multimedia data cube can contain additional dimensions and measures for multimedia information, such as colour,
texture, and shape.
Example:
Classification and prediction analysis of astronomy data
Taking sky images that have been carefully classified by astronomers as the training set, we can construct models
for the recognition of galaxies, stars, and other stellar objects, based on properties like magnitudes, areas, intensity,
image moments, and orientation. A large number of sky images taken by telescopes or space probes can then be
tested against the constructed models in order to identify new celestial bodies. Similar studies have successfully
been performed to identify volcanoes on Venus.
122/JNU OLE
Unrestricted
exploratory
Collect data freedom
Repository Optimise
View
results
Precision: This is the percentage of retrieved documents that are in fact relevant to the query (that is “correct”
responses). Recall: This is the percentage of documents that are relevant to the query and were, in fact, retrieved.
123/JNU OLE
Data Mining
6.6.4 Challenges
The Web seems to be too huge for effective data warehousing and data mining
• The complexity of Web pages is far greater than that of any traditional text document collection
• The Web is a highly dynamic information source
• The Web serves a broad diversity of user communities
• Only a small portion of the information on the Web is truly relevant or useful
124/JNU OLE
6.11 Data Mining for Higher Education
An important challenge that higher education faces today is predicting paths of students and alumni. Which student
will enrol in particular course programs? Who will need an additional assistance in order to graduate? Meanwhile
additional issues such as enrolment management and time-to degree, continue to exert pressure on colleges to
search for new and faster solutions. Institutions can better address these students and alumni through the analysis
and presentation of data. Data mining has quickly emerged as a highly desirable tool for using current reporting
capabilities to uncover and understand hidden patterns in vast databases.
The data mining methods should be more interactive and user friendly. One important direction towards improving
the repair efficiency of the timing process while increasing user interaction is constraint-based mining. This provide
user with more control by allowing the specification and use of constraints to guide data mining systems in their
search for interesting patterns.
6.12.3 Combination of Data Mining with Database Systems, Data Warehouse Systems, and Web Database
Systems
Database systems, data warehouse systems, and WWW are loaded with huge amounts of data and have thus become
the major information processing systems. It is important to ensure that data mining serves as essential data analysis
component that can be easily included into such an information-processing environment. The desired architecture
for data mining system is the tight coupling with database and data warehouse systems. Transaction management
query processing, online analytical processing and online analytical mining should be integrated into one unified
framework.
125/JNU OLE
Data Mining
Since, data mining is a young discipline with wide and diverse applications, there is still a nontrivial gap between
general principles of data mining and domain specific, effective data mining tools for particular applications. A few
application domains of Data Mining (such as finance, the retail industry and telecommunication) and Trends in Data
Mining, which include further efforts towards the exploration of new application areas and new methods for handling
complex data types, algorithms scalability, constraint based mining and visualisation methods, the integration of
data mining with data warehousing and database systems, the standardisation of data mining languages, and data
privacy protection and security.
126/JNU OLE
data, stream data, time-series data, biological data, or Web data, or are dedicated to specific applications (such
as finance, the retail industry, or telecommunications). Moreover, many data mining companies offer customised
data mining solutions that incorporate essential data mining functions or methodologies.
• System issues: A given data mining system may run on only one operating system or on several. The most
popular operating systems that host data mining software are UNIX/Linux and Microsoft Windows. There are
also data mining systems that run on Macintosh, OS/2, and others. Large industry-oriented data mining systems
often adopt a client/server architecture, where the client could be a personal computer, and the server could be
a set of powerful parallel computers. A recent trend has data mining systems providing Web-based interfaces
and allowing XML data as input and/or output.
• Data sources: This refers to the specific data formats on which the data mining system will operate. Some
systems work only on ASCII text files, whereas many others work on relational data or data warehouse data,
accessing multiple relational data sources. It is essential that a data mining system supports ODBC connections
or OLE DB for ODBC connections. These make sure open database connections, that is, the ability to access
any relational data (including those in IBM/DB2, Microsoft SQL Server, Microsoft Access, Oracle, Sybase,
and so on), as well as formatted ASCII text data.
• Data mining functions and methodologies: Data mining functions form the core of a data mining system. Some
data mining systems provide only one data mining function, such as classification. Others may support multiple
data mining functions, such as concept description, discovery-driven OLAP analysis, association mining, linkage
analysis, statistical analysis, classification, prediction, clustering, outlier analysis, similarity search, sequential
pattern analysis, and visual data mining. For a given data mining function (such as classification), some systems
may support only one method, whereas others may support a wide variety of methods (such as decision tree
analysis, Bayesian networks, neural networks, support vector machines, rule based classification, k-nearest-
neighbour methods, genetic algorithms, and case-based reasoning). Data mining systems that support multiple
data mining functions and multiple methods per function provide the user with greater flexibility and analysis
power. Many problems may require users to try a few different mining functions or incorporate several together,
and different methods can be more effective than others for different kinds of data. In order to take advantage of
the added flexibility, however, users may require further training and experience. Thus, such systems should also
provide novice users with convenient access to the most popular function and method, or to default settings.
• Coupling data mining with database and/or data warehouse systems: A data mining system should be coupled
with a database and/or data warehouse system, where the coupled components are seamlessly integrated into a
uniform information processing environment. In general, there are four forms of such coupling: no coupling,
loose coupling, semi tight coupling, and tight coupling. Some data mining systems work only with ASCII data
files and are not coupled with database or data warehouse systems at all. Such systems have difficulties using
the data stored in database systems and handling large data sets efficiently. In data mining systems that are
loosely coupled with database and data warehouse systems, the data are retrieved into a buffer or main memory
by database or warehouse operations, and then mining functions are applied to analyse the retrieved data. These
systems may not be equipped with scalable algorithms to handle large data sets when processing data mining
queries. The coupling of a data mining system with a database or data warehouse system may be semi tight,
providing the efficient implementation of a few essential data mining primitives (such as sorting, indexing,
aggregation, histogram analysis, multiway join, and the precomputation of some statistical measures). Ideally, a
data mining system should be tightly coupled with a database system in the sense that the data mining and data
retrieval processes are integrated by optimising data mining queries deep into the iterative mining and retrieval
process. Tight coupling of data mining with OLAP-based data warehouse systems is also desirable so that data
mining and OLAP operations can be integrated to provide OLAP-mining features.
• Scalability: Data mining has two kinds of scalability issues: row (or database size) scalability and column (or
dimension) scalability. A data mining system is considered row scalable if, when the number of rows is enlarged
10 times, it takes no more than 10 times to execute the same data mining queries. A data mining system is
considered column scalable if the mining query execution time increases linearly with the number of columns
(or attributes or dimensions). Due to the curse of dimensionality, it is much more challenging to make a system
column scalable than row scalable.
127/JNU OLE
Data Mining
• Visualisation tools: “A picture is worth a thousand words”—this is very true in data mining. Visualisation in
data mining can be categorised into data visualisation, mining result visualisation, mining process visualisation,
and visual data mining. The variety, quality, and flexibility of visualisation tools may strongly influence the
usability, interpretability, and attractiveness of a data mining system.
• Data mining query language and graphical user interface: Data mining is an exploratory process. An easy-to-
use and high-quality graphical user interface is necessary in order to promote user-guided, highly interactive
data mining. Most data mining systems provide user-friendly interfaces for mining. However, unlike relational
database systems, where most graphical user interfaces are constructed on top of SQL (which serves as a standard,
well-designed database query language), most data mining systems do not share any underlying data mining
query language. Lack of a standard data mining language makes it difficult to standardise data mining products
and to make sure the interoperability of data mining systems. Recent efforts at defining and standardising data
mining query languages include Microsoft’s OLE DB for Data Mining.
128/JNU OLE
These theories are not mutually exclusive. For example, pattern discovery can also be seen as a form of data reduction
or data compression. Ideally, a theoretical framework should be able to model typical data mining tasks (such as
association, classification, and clustering), have a probabilistic nature, be able to handle different forms of data, and
consider the iterative and interactive essence of data mining. Further efforts are required toward the establishment
of a well-defined framework for data mining, which satisfies these requirements.
129/JNU OLE
Data Mining
Visual data mining can be viewed as an integration of two disciplines: data visualisation and data mining. It is also
closely related to computer graphics, multimedia systems, human computer interaction, pattern recognition, and
high-performance computing. In general, data visualisation and data mining can be integrated in the following
ways:
• Data visualisation: Data in a database or data warehouse can be viewed at different levels of granularity or
abstraction, or as different combinations of attributes or dimensions. Data can be presented in various visual
forms, such as boxplots, 3-D cubes, data distribution charts, curves, surfaces, link graphs, and so on. Visual
display can help give users a clear impression and overview of the data characteristics in a database.
• Data mining result visualisation: Visualisation of data mining results is the presentation of the results or knowledge
obtained from data mining in visual forms. Such forms may include scatter plots and boxplots (obtained from
descriptive data mining), as well as decision trees, association rules, clusters, outliers, generalised rules, and
so on.
• Data mining process visualisation: This type of visualisation presents the various processes of data mining in
visual forms so that users can see how the data are extracted and from which database or data warehouse they
are extracted, as well as how the selected data are cleaned, integrated, preprocessed, and mined. Moreover, it
may also show which method is selected for data mining, where the results are stored, and how they may be
viewed.
• Interactive visual data mining: In (interactive) visual data mining, visualisation tools can be used in the data
mining process to help users make smart data mining decisions. For example, the data distribution in a set
of attributes can be displayed using coloured sectors (where the whole space is represented by a circle). This
display helps users to determine which sector should first be selected for classification and where a good split
point for this sector may be.
Audio data mining uses audio signals to indicate the patterns of data or the features of data mining results. Although
visual data mining may disclose interesting patterns using graphical displays, it requires users to concentrate on
watching patterns and identifying interesting or novel features within them. This can sometimes be quite tiresome.
If patterns can be transformed into sound and music, then instead of watching pictures, we can listen to pitches,
rhythms, tune, and melody in order to identify anything interesting or unusual. This may relieve some of the burden
of visual concentration and be more relaxing than visual mining. Therefore, audio data mining is an interesting
complement to visual mining.
A collaborative recommender system works by finding a set of customers, referred to as neighbours, that have a
history of agreeing with the target customer (such as, they tend to buy similar sets of products, or give similar ratings
for certain products). Collaborative recommender systems face two major challenges: scalability and ensuring quality
recommendations to the consumer. Scalability is important, because e-commerce systems must be able to search
through millions of potential neighbours in real time. If the site is using browsing patterns as indications of product
preference, it may have thousands of data points for some of its customers. Ensuring quality recommendations
130/JNU OLE
is essential in order to gain consumers’ trust. If consumers follow a system recommendation but then do not end
up liking the product, they are less likely to use the recommender system again. As with classification systems,
recommender systems can make two types of errors: false negatives and false positives. Here, false negatives are
products that the system fails to recommend, although the consumer would like them. False positives are products
that are recommended, but which the consumer does not like. False positives are less desirable because they can
annoy or anger consumers.
An advantage of recommender systems is that they provide personalisation for customers of e-commerce, promoting
one-to-one marketing. Dimension reduction, association mining, clustering, and Bayesian learning are some of the
techniques that have been adapted for collaborative recommender systems. While collaborative filtering explores
the ratings of items provided by similar users, some recommender systems explore a content-based method that
provides recommendations based on the similarity of the contents contained in an item. Moreover, some systems
integrate both content-based and user-based methods to achieve further improved recommendations. Collaborative
recommender systems are a form of intelligent query answering, which consists of analysing the intent of a query
and providing generalised, neighbourhood, or associated information relevant to the query.
131/JNU OLE
Data Mining
Summary
• Data mining is the process of extraction of interesting patterns or knowledge from huge amount of data. It is the
set of activities used to find new, hidden or unexpected patterns in data or unusual patterns in data.
• An important feature of object-relational and object-oriented databases is their capability of storing, accessing, and
modelling complex structure-valued data, such as set- and list-valued data and data with nested structures.
• Generalisation on a class composition hierarchy can be viewed as generalisation on a set of nested structured
data.
• A spatial database stores a large amount of space-related data, such as maps, preprocessed remote sensing or
medical imaging data, and VLSI chip layout data. Spatial data mining refers to the extraction of knowledge,
spatial relationships, or other interesting patterns not explicitly stored in spatial databases.
• Spatial database systems usually handle vector data that consist of points, lines, polygons (regions), and their
compositions, such as networks or partitions.
• Text Data Analysis and Information Retrieval Information retrieval (IR) is a field that has been developing in
parallel with database systems for many years.
• Once an inverted index is created for a document collection, a retrieval system can answer a keyword query
quickly by looking up which documents contain the query keywords.
• The World Wide Web serves as a huge, widely distributed, global information service centre for news,
advertisements, consumer information, financial management, education, government, e-commerce, and many
other information services.
• Recent research in DNA analysis has led to the discovery of genetic causes for many diseases and disabilities
as well as approaches for disease diagnosis, prevention and treatment.
References
• Han, J., Kamber, M. and Pei, J., 2011. Data Mining: Concepts and Techniques, 3rd ed., Elsevier.
• Alexander, D., Data Mining [Online] Available at: <http://www.laits.utexas.edu/~norman/BUS.FOR/course.
mat/Alex/>. [Accessed 9 September 2011].
• Galeas, Web mining [Online PDF} Available at: <http://www.galeas.de/webmining.html>. [Accessed 12
September 2011].
• Springerlink, 2006. Data Mining System Products and Research Prototypes [Online PDF] Available at: <http://
www.springerlink.com/content/2432076500506017/>. [Accessed 12 September 2011].
• Dr. Kuonen, D., 2009. Data Mining Applications in Pharma/BioPharma Product Development [Video Online]
Available at: <http://www.youtube.com/watch?v=kkRPW5wSwNc>. [Accessed 12 September 2011].
• SalientMgmtCompany,2011. Salient Visual Data Mining [Video Online] Available at: < http://www.youtube.
com/watch?v=fosnA_vTU0g>. [Accessed 12 September 2011].
Recommended Reading
• Liu, B., 2011. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, 2nd ed., Springer.
• Scime, A., 2005. Web mining: applications and techniques, Idea Group Inc. (IGI).
• Markov, Z. and Larose, D. T., 2007. Data mining the Web: uncovering patterns in Web content, structure, and
usage, Wiley-Interscience.
132/JNU OLE
Self Assessment
1. Using information contained within__________, data mining can often provide answers to questions about an
organisation that a decision maker has previously not thought to ask.
a. metadata
b. web mining
c. data warehouse
d. data extraction
3. Which of the following stores a large amount of space-related data, such as maps, preprocessed remote sensing
or medical imaging data, and VLSI chip layout data?
a. Knowledge database
b. Spatial database
c. Data mining
d. Data warehouse
4. ___________ systems usually handle vector data that consist of points, lines, polygons and their compositions,
such as networks or partitions.
a. Knowledge database
b. Data mining
c. Data warehouse
d. Spatial database
6. Which of the following can contain additional dimensions and measures for multimedia information, such as
colour, texture, and shape?
a. Multimedia data cube
b. Multimedia data mining
c. Multimedia database
d. Spatial database
133/JNU OLE
Data Mining
7. What is used for mining multimedia data, especially in scientific research, such as astronomy, seismology, and
geoscientific research?
a. Data warehousing
b. Classification and predictive modelling
c. Data extraction
d. Dimensional modelling
8. _________ is the percentage of retrieved documents that are in fact relevant to the query.
a. Recall
b. Precision
c. Text mining
d. Information retrieval
10. In which of the following theories, the basis of data mining is to reduce the data representation?
a. Data reduction
b. Data compression
c. Pattern discovery
d. Probability theory
134/JNU OLE
Chapter VII
Implementation and Maintenance
Aim
The aim of this chapter is to:
Objectives
The objectives of this chapter are to:
Learning outcome
At the end of this chapter, you will be able to:
135/JNU OLE
Data Mining
7.1 Introduction
As you know, in an OLTP system, you have to perform a number of tasks for completing the physical model. The
logical model forms the primary basis for the physical model. But, in addition, a number of factors must be considered
before get to the physical model. You must determine where to place the database objects in physical storage. What
is the storage medium and what are its features? This information helps you to define the storage parameters. Later
you need to plan for indexing, which is an important consideration on which columns in each table the indexes must
be built..? You need to look into other methods for improving performance. You have to examine the initialisation
parameters in the DBMS and decide how to set them. Similarly, in the data warehouse environment, you need to
consider many different factors to complete the physical model.
Develop
Standards Create
Aggregates
Determine
Plan
Data
Portioning
Establish
Clustering
Options
Prepare
Indexing
Strategy
Complete Assign Storage
Physical Structures
Model
136/JNU OLE
7.2.2 Create Aggregates Plan
If your data warehouse stores data only at the lowest level of granularity, every such query has to read through all
the detailed records and sum them up. Consider a query looking for total sales for the year, by product, for all the
stores. If you have detailed records keeping sales by individual calendar dates, by product, and by store, then this
query needs to read a large number of detailed records. So what is the best method to improve performance in such
cases? If you have higher levels of summary tables of products by store, the query could run faster. But how many
such summary tables must you create? What is the limit?
In this step, review the possibilities for building aggregate tables. You get clues from the requirements definition.
Look at each dimension table and examine the hierarchical levels. Which of these levels are more important for
aggregation? Clearly assess the tradeoff. What you need is a comprehensive plan for aggregation. The plan must
spell out the exact types of aggregates you must build for each level of summarisation. It is possible that many of
the aggregates will be present in the OLAP system. If OLAP instances are not for universal use by all users, then
the necessary aggregates must be present in the main warehouse. The aggregate database tables must be laid out
and included in the physical model.
Partitioning divides large database tables into manageable parts. Always consider partitioning options for fact
tables. It is not just the decision to partition that counts. Based on your environment, the real decision is about
how exactly to partition the fact tables. Your data warehouse may be a conglomerate of conformed data marts. You
must consider partitioning options for each fact table. You may find that some of your dimension tables are also
candidates for partitioning. Product dimension tables are especially large. Examine each of your dimension tables
and determine which of these must be partitioned. In this step, come up with a definite partitioning scheme. The
scheme must include:
The fact tables and the dimension tables selected for partitioning
• The type of partitioning for each table—horizontal or vertical
• The number of partitions for each table
• The criteria for dividing each table (for example, by product groups)
• Description of how to make queries aware of partitions
For two related tables, you may want to store the records from both files interleaved. A record from one table is
followed by all the related records in the other table while storing in the same file.
137/JNU OLE
Data Mining
What are the various physical data structures in the storage area? What is the storage medium and what are its
characteristics? Do the features of the medium support any efficient storage or retrieval techniques? We will explore
answers to questions such as these. From the answers you will derive methods for improving performance.
138/JNU OLE
You can see following figure showing the physical data structures in the data warehouse. Observe the different levels
of data. Notice the detail and summary data structures. Think further how the data structures are implemented in
physical storage as files, blocks, and records.
Partitioned
Relational database data files physical
(warehouse data) files
Data extract flat files?
Detailed data
Relational database index and light
Relational database data files files summaries
(transformed data)
Physical files in
proprietary matrix format storing
multidimensional cubes of data
Remember, any optimising at the physical level is tied to the features and functions available in the DBMS. You
have to relate the techniques discussed here with the workings of your DBMS. Please study the following optimising
techniques.
139/JNU OLE
Data Mining
More records or rows will fit into a single block. Because more records may be fetched in one read, larger block sizes
decrease the number of reads. Another advantage relates to space utilisation by the block headers. As a percentage
of the space in a block, the block header occupies less space in a larger block. Therefore, overall, all the block
headers put together occupy less space. But here is the downside of larger block sizes. Even when a smaller number
of records are needed, the operating system reads too much extra information into memory, thereby impacting
memory management.
However, because most data warehouse queries request large numbers of rows, memory management as indicated
rarely poses a problem. There is another aspect of data warehouse tables that could cause some concern. Data
warehouse tables are denormalised and therefore, the records tend to be large. Sometimes, a record may be too
large to fit in a single block. Then the record has to be split across more than one block. The broken parts have to be
connected with pointers or physical addresses. Such pointer chains affect performance to a large extent. Consider
all the factors and set the block size at the appropriate size. Generally, increased block size gives better performance
but you have to find the proper size.
140/JNU OLE
• Disk mirroring—writing the same data to two disk drives connected to the same controller
• Disk duplexing—similar to mirroring, except here each drive has its own distinct controller
• Parity checking—addition of a parity bit to the data to ensure correct data transmission
• Disk striping—data spread across multiple disks by sectors or bytes
RAID is implemented at six different levels: RAID 0 through RAID 5. Please turn to Figure 7.3, which gives you
a brief description of RAID. Note the advantages and disadvantages. The lowest level configuration RAID 0 will
provide data striping. At the other end of the range, RAID 5 is a very valuable arrangement.
High performance for large Can handle multiple I/Os Dedicated parity drive un-
blocks of data--on the fly from sophisticated OS--used necessary, works with 2 or
recovery not guaranteed with only two drives more drives--poor write
141/JNU OLE
Data Mining
Estimate
• Temporary work space for sorting, merging
• Temporary files in the staging area
• Permanent files in the staging area
What types of indexes must you build in your data warehouse? The DBMS vendors offer a variety of choices. The
choice is no longer confined to sequential index files. All vendors support B-Tree indexes for efficient data retrieval.
Another option is the bitmapped index. As we will see later in this section, this indexing technique is very appropriate
for the data warehouse environment. Some vendors are extending the power of indexing to specific requirements.
These include indexes on partitioned tables and index-organised tables.
If a column in a table has many unique values, then the selectivity of the column is said to be high. In a territory
dimension table, the column for City contains many unique values. This column is therefore highly selective. B-Tree
indexes are most suitable for highly selective columns. Because the values at the leaf nodes will be unique they will
lead to distinct data rows and not to a chain of rows. What if a single column is not highly selective?
142/JNU OLE
A-K
L-Z
A-D L-O
E-G P-R
H-K S-Z
Indexes grow in direct proportion to the growth of the indexed data table. Wherever indexes contain concatenation of
multiple columns, they tend to sharply increase in size. As the data warehouse deals with large volumes of data, the
size of the index files can be cause for concern. What can we say about the selectivity of the data in the warehouse?
Are most of the columns highly selective? Not really. If you inspect the columns in the dimension tables, you will
notice a number of columns that contain low-selectivity data. B-Tree indexes do not work well with data whose
selectivity is low. What is the alternative? That leads us to another type of indexing technique.
143/JNU OLE
Data Mining
Please study the following tips and use them when planning to create indexes for the fact tables:
• If the DBMS does not create an index on the primary key, deliberately create a B-Tree index on the full primary
key.
• Carefully design the order of individual key elements in the full concatenated key for indexing. In the high order
of the concatenated key, place the keys of the dimension tables frequently referred to while querying.
• Review the individual components of concatenated key. Create indexes on combinations based on query
processing requirements.
• If the DBMS supports intelligent combinations of indexes for access, then you may create indexes on each
individual component of the concatenated key.
• Do not overlook the possibilities of indexing the columns containing the metrics. For example, if many queries
look for dollar sales within given ranges, then the column “dollar sales” is a candidate for indexing.
• Bitmapped indexing does not apply to fact tables. There are hardly any low-selectivity columns.
144/JNU OLE
Queries also run longer when attempting to sort through large volumes of data to obtain the result sets. Backing up
and recovery of huge tables takes an inordinately long time. Again, when you want to selectively purge and archive
records from a large table, wading through all the rows takes a long time.
Performing maintenance operations on smaller pieces is easier and faster. Partitioning is a crucial decision and must
be planned up front. Doing this after the data warehouse is deployed and goes into production is time-consuming
and difficult. Partitioning means deliberate splitting of a table and its index data into manageable parts. The DBMS
supports and provides the mechanism for partitioning. When you define the table, you can define the partitions as
well. Each partition of a table is treated as a separate object. As the volume increases in one partition, you can split
that partition further. The partitions are spread across multiple disks to gain optimum performance. Each partition
in a table may have distinct physical attributes, but all partitions of the table have the same logical attributes.
As you observe, partitioning is an effective technique for storage management and improving performance. The
benefits are as follows.
• A query needs to access only the necessary partitions. Applications can be given the choice to have partition
transparency or they may explicitly request an individual partition. Queries run faster when accessing smaller
amounts of data.
• An entire partition may be taken off-line for maintenance. You can separately schedule maintenance of partitions.
Partitions promote concurrent maintenance operations.
• Index building is faster.
• Loading data into the data warehouse is easy and manageable.
• Data corruption affects only a single partition. Backup and recovery on a single partition reduces downtime.
• The input–output load gets balanced by mapping different partitions to the various disk drives.
145/JNU OLE
Data Mining
In addition, rolling summary structures are especially useful in a data warehouse. Suppose in your data warehouse
you need to keep hourly data, daily data, weekly data, and monthly summaries, create mechanisms to roll the data
into the next higher levels automatically with the passage of time. Hourly data automatically gets summarised into
the daily data, daily data into the weekly data, and so on.
Although creating arrays is a clear violation of normalisation principles, this technique yields tremendous performance
improvement. In the data warehouse, the time element is interwoven into all data. Frequently, users look for data
in a time series. Another example is the request for monthly sales figures for 24 months for each salesperson. If
you analyse the common queries, you will be surprised to see how many need data that can be readily stored in
arrays.
It is never enough to simply deploy a solution then ‘leave’. Ongoing maintenance and future enhancements must be
managed; a programme of user training is often required, apart from the logistics of the deployment itself. Timing
of a deployment is critical. Allow too much time and you risk missing your deadlines, while you allow too less time
and you run in resourcing problems. As with most IT work, never underestimate the amount of work involved and
also the amount of time required.
The data warehouse might need to be customised for deployment to a particular country or location, where they
might use the general design but have their own data needs. It is not uncommon for different parts of the same
organisation to use different computer systems, particularly where mergers and acquisitions are involved, so the
data warehouse must be modified to allow this as part of the deployment.
146/JNU OLE
Roll-out to production
This takes place after user acceptance testing (UAT) and includes: moving the data warehouse to the live servers,
loading all the live data – not just some of it for testing purposes, optimising the databases and implementing security.
All of this must involve minimum disruption to the system users. Needless to say, you need to be very confident
everything is in place and working before going live – or you might find you have to do it all over again.
Scheduling jobs
In a production environment jobs such as data warehouse loading must be automated in scripts and scheduled to
run automatically. A suitable time slot must be found that does not conflict with other tasks happening on the same
servers. Procedures must be in place to deal with unexpected events and failures.
Regression testing
This type of testing is part of the deployment process and searches for errors that were fixed at one point but have
somehow been introduced by the change in environment.
147/JNU OLE
Data Mining
• Deployment
Install the Analytics reporting and the ETL tools.
Specific Setup and Configuration for OLTP, ETL, and data warehouse.
Sizing of the system and database
Performance Tuning and Optimisation
• Management and Maintenance of the system
Ongoing support of the end-users, including security, training, and enhancing the system.
You need to monitor the growth of the data.
Immediately following the initial deployment, the project team must conduct review sessions. Here are the major
review tasks:
• Review the testing process and suggest recommendations.
• Review the goals and accomplishments of the pilots.
• Survey the methods used in the initial training sessions.
• Document highlights of the development process.
• Verify the results of the initial deployment, matching these with user expectations.
The review sessions and their outcomes form the basis for improvement in the further releases of the data warehouse.
As you expand and produce further releases, let the business needs, modelling considerations, and infrastructure
factors remain as the guiding factors for growth. Follow each release close to the previous release. You can make
use of the data modelling done in the earlier release. Build each release as a logical next step. Avoid disconnected
releases. Build on the current infrastructure.
Following figure presents the data warehousing monitoring activity and its usefulness. As you can observe, the
statistics serve as the life-blood of the monitoring activity. That leads into growth planning and fine-tuning of the
data warehouse.
148/JNU OLE
Data
Warehouse
Warehouse Data
Review statistics
for growth
planning and
Monitoring Statistics performance
tuning
The tools that come with the database server and the host operating system are generally turned on to collect the
monitoring statistics. Over and above these, many third-party vendors supply tools especially useful in a data
warehouse environment. Most tools gather the values for the indicators and also interpret the results. The data
collector component collects the statistics while thethe analyser component does the interpretation. Most of the
monitoring of the system occurs in real time.
The following is a random list that includes statistics for different uses. You will find most of these applicable to
your environment.
• Physical disk storage space utilisation
• Number of times the DBMS is looking for space in blocks or causes fragmentation
• Memory buffer activity
• Buffer cache usage
• Input–output performance
• Memory management
149/JNU OLE
Data Mining
• Profile of the warehouse content, giving number of distinct entity occurrences (example: number of customers,
products, and so on)
• Size of each database table
• Accesses to fact table records
• Usage statistics relating to subject areas
• Numbers of completed queries by time slots during the day
• Time each user stays online with the data warehouse
• Total number of distinct users per day
• Maximum number of users during time slots daily
• Duration of daily incremental loads
• Count of valid users
• Query response times
• Number of reports run each day
• Number of active tables in the database
150/JNU OLE
7.7.5 Publishing Trends for Users
This is a new concept not usually found in OLTP systems. In a data warehouse, the users must find their way into
the system and retrieve the information by themselves. They must know about the contents. Users must know about
the currency of the data in the warehouse. When was the last incremental load? What are the subject areas? What
is the count of distinct entities? The OLTP systems are quite different. These systems readily present the users with
routine and standardised information. Users of OLTP systems do not need the inside view. Look at the following
figure listing the types of statistics that must be published for the users. In your data warehouse is Web-enabled,
use the company’s intranet to publish the statistics for the users. Otherwise, provide the ability to inquire into the
dataset where the statistics are kept.
Web-enables
Intranet data warehouse
Warehouse data
Web page
statistics and i
nformation Metadata
151/JNU OLE
Data Mining
152/JNU OLE
• Ensure ability to shift data from bad storage sectors.
• Look for storage systems with diagnostics to prevent outages.
There may not be any point in repeating the indexing and other techniques that you already know from the OLTP
environment. Following are a few practical suggestions:
• Have a regular schedule to review the usage of indexes. Drop the indexes that are no longer used.
• Monitor query performance daily. Investigate long-running queries. Work with the user groups that seem to be
executing long-running queries. Create indexes if needed.
• Analyse the execution of all predefined queries on a regular basis. RDBMSs have query analysers for this
purpose.
153/JNU OLE
Data Mining
• Review the load distribution at different times per day. Determine the reasons for large variations.
• Although you have instituted a regular schedule for ongoing fine-tuning, from time to time, you will come
across some queries that suddenly cause grief. You will hear complaints from a specific group of users. Be
prepared for such ad hoc fine-tuning needs. The data administration team must have staff set apart for dealing
with these situations.
Response models
The best method for identifying the customers or prospects to target for a specific product offering is through the
use of a model developed specifically to predict response. These models are used to identify the customers most
likely to exhibit the behaviour being targeted. Predictive response models allow organisations to find the patterns
that separate their customer base so the organisation can contact those customers or prospects most likely to take
the desired action. These models contribute to more effective marketing by ranking the best candidates for a specific
product offering thus identifying the low hanging fruit.
These models use a scoring algorithm specifically calibrated to select revenue-producing customers and help identify
the key characteristics that best identify better customers. They can be used to fine-tune standard response models
or used in acquisition strategies.
154/JNU OLE
Cross-sell and up-sell models
Cross-sell/up-sell models identify customers who are the best prospects for the purchase of additional products and
services and for upgrading their existing products and services. The goal is to increase share of wallet. Revenue can
increase immediately, but loyalty is enhanced as well due to increased customer involvement.
Attrition models
Efficient, effective retention programs are critical in today’s competitive environment. While it is true that it is less
costly to retain an existing customer than to acquire a new one, the fact is that all customers are not created equal.
Attrition models enable you to identify customers who are likely to churn or switch to other providers thus allowing
you to take appropriate pre-emptive action. When planning retention programs, it is essential to be able to identify
best customers, how to optimise existing customers and how to build loyalty through “entanglement”. Attrition
models are best employed when there are specific actions that the client can take to retard cancellation or cause
the customer to become substantially more committed. The modelling technique provides an effective method for
companies to identify characteristics of chumers for acquisition efforts and also to prevent or forestall cancellation
of customers.
eNuggets is a revolutionary new business intelligence tool that can be used for web personalisation or other real
time business intelligence purposes. It can be easily integrated with existing systems such as CRM, outbound
telemarketing (that is intelligent scripting), insurance underwriting, stock forecasting, fraud detection, genetic
research and many others.
eNuggetsTM uses historical (either from company transaction data or from outside data) data to extract information
in the form of English rules understandable by humans. The rules collectively form a model of the patterns in the
data that would not be evident to human analysis. When new data comes in, such as a stock transaction from ticker
data, e-NuggetsTM interrogates the model and finds the most appropriate rule to suggest which course of action will
provide the best result (that is buy, sell or hold).
155/JNU OLE
Data Mining
Summary
• The logical model forms the primary basis for the physical model.
• Many companies invest a lot of time and money to prescribe standards for information systems. The standards
range from how to name the fields in the database to how to conduct interviews with the user departments for
requirements definition.
• Standards take on greater importance in the data warehouse environment.
• If the data warehouse stores data only at the lowest level of granularity, every such query has to read through
all the detailed records and sum them up.
• If OLAP instances are not for universal use by all users, then the necessary aggregates must be present in the
main warehouse. The aggregate database tables must be laid out and included in the physical model.
• During the load process, the entire table must be closed to the users.
• In the data warehouse, many of the data access patterns rely on sequential access of large quantities of data.
• Preparing an indexing strategy is a crucial step in the physical design. Unlike OLTP systems, the data warehouse
is query-centric.
• The efficiency of the data retrieval is closely tied to where the data is stored in physical storage and how it is
stored there.
• Most of the leading DBMSs allow you to set block usage parameters at appropriate values and derive performance
improvement.
• Redundant array of inexpensive disks (RAID) technology has become common to the extent that almost all of
today’s data warehouses make good use of this technology.
• In a query-centric system like the data warehouse environment, the need to process queries faster dominates.
• Bitmapped indexes are ideally suitable for low-selectivity data.
• Once the data warehouse is designed, built and tested, it needs to be deployed so it is available to the user
community.
• Data warehouse management is concerned with two principal functions: maintenance management and change
management.
References
• Ponniah, P., 2001. DATA WAREHOUSING FUNDAMENTALS-A Comprehensive Guide for IT Professionals,
Wiley-Interscience Publication.
• Larose, D. T., 2006. Data mining methods and models, John Wiley and Sons.
• Wan, D., 2007. Typical data warehouse deployment lifecycle [Online] Available at: <http://dylanwan.wordpress.
com/2007/11/02/typical-data-warehouse-deployment-lifecycle/>. [Accessed 12 September 2011].
• Statsoft, Data Mining Techniques [Online] Available at: <http://www.statsoft.com/textbook/data-mining-
techniques/>. [Accessed 12 September 2011].
• StatSoft, 2010. Data Mining, Model Deployment and Scoring - Session 30 [Video Online] Available at : < http://
www.youtube.com/watch?v=LDoQVbWpgKY>. [Accessed 12 September 2011].
• OracleVideo, 2010. Data Warehousing Best Practices Star Schemas [Video Online] Available at : < http://www.
youtube.com/watch?v=LfehTEyglrQ>. [Accessed 12 September 2011].
Recommended Reading
• Kantardzic, M., 2001. Data Mining: Concepts, Models, Methods, and Algorithms, 2nd ed., Wiley-IEEE.
• Khan, A., 2003. Data Warehousing 101: Concepts and Implementation, iUniverse.
• Rainardi, V., 2007. Building a data warehouse with examples in SQL Server, Apress.
156/JNU OLE
Self Assessment
1. The logical model forms the __________ basis for the physical model.
a. primary
b. secondary
c. important
d. former
2. If _________ instances are not for universal use by all users, then the necessary aggregates must be present in
the main warehouse.
a. KDD
b. DMKD
c. OLAP
d. SDMKD
6. In most cases, the supporting proprietary software dictates the storage and the retrieval of data in the ________
system.
a. KDD
b. OLTP
c. SDMKD
d. OLAP
157/JNU OLE
Data Mining
B. writing the same data to two disk drives connected to the same
2. Disk duplexing
controller
10. In a __________ system like the data warehouse environment, the need to process queries faster dominates.
a. OLAP
b. query-centric
c. OLTP
d. B-Tree index
158/JNU OLE
Case study I
Logic-ITA student data
We have performed a number of queries on datasets collected by the Logic-ITA to assist teaching and learning.
The Logic-ITA is a web-based tutoring tool used at Sydney University since 2001, in a course taught by the second
author. Its purpose is to help students practice logic formal proofs and to inform the teacher of the class progress.
Context of use
Over the four years, around 860 students attended the course and used the tool, in which an exercise consists of a set
of formulas (called premises) and another formula (called the conclusion). The aim is to prove that the conclusion can
validly be derived from the premises. For this, the student has to construct new formulas, step by step, using logic
rules and formulas previously established in the proof, until the conclusion is derived. There is no unique solution
and any valid path is acceptable. Steps are checked on the fly and, if incorrect, an error message and possibly a tip
are displayed. Students used the tool at their own discretion. A consequence is that there is neither a fixed number
nor a fixed set of exercises done by all students.
Data stored
The tool’s teacher module collates all the student models into a database that the teacher can query and mine. Two
often queried tables of the database are the tables mistake and correct step. The most common variables are shown
in Table 1.1.
(Source: Merceron, A. and Yacef, K., Educational Data Mining: a Case Study, )
Questions
1. What is Logic-ITA? What is the purpose of
Answer
The Logic-ITA is a web-based tutoring tool used at Sydney University since 2001, in a course taught by the
second author. Its purpose is to help students practice logic formal proofs and to inform the teacher of the class
progress.
159/JNU OLE
Data Mining
160/JNU OLE
Case Study II
A Case Study of Exploiting Data Mining Techniques for an Industrial Recommender System
In this case study, we aim to providing recommendations to the loyal customers of a chain of fashion retail stores
based in Spain. In particular, the retail stores would like to be able to generate targeted product recommendations
to loyal customers based on either customer demographics, customer transaction history, or item properties. A
comprehensive description of the available dataset with the above information is provided in the next subsection. The
transformation of this dataset into a format that can be exploited by Data Mining and Machine Learning techniques
is described in sections below.
Dataset
The dataset used for this case study contained data on customer demographics, transactions performed, and item
properties. The entire dataset covers the period of 01/01/2007 – 31/12/2007.
There were 1,794,664 purchase transactions by both loyal and non-loyal customers. The average value of a purchased
item was €35.69. We removed the transactions performed by non-loyal customers, which reduced the number of
purchase transactions to 387,903 by potentially 357,724 customers. We refer to this dataset as Loyal. The average
price of a purchased item was €37.81.
We then proceeded to remove all purchased items with a value of less than €0 because these represent refunds. This
reduced the number of purchase transactions to 208,481 by potentially 289,027 customers. We refer to this dataset
as Loyal-100.
Dataset Processing
We processed the Loyal dataset to remove incomplete data for the demographic, item, and purchase transaction
attributes.
Demographic Attributes
Table 2.1 shows the four demographic attributes we used for this case study. The average item price attribute was
not contained in the database; it was derived from the data.
The date of birth attribute was provided in seven different valid formats, alongside several invalid formats. The
invalid formats results in 17,125 users being removed from the Loyal dataset. The date of birth was further
processed to produce the age of the user in years. We considered an age of less than 18 to be invalid because of
the requirement for a loyal customer to be 18 years old to join the scheme; we also considered an age of more than
80 to be unusually old based on the life expectancy of a Spanish person. Customers with an age out with the 18 –
80 range were removed from the dataset. Customers without a gender, or a Not Applicable gender were removed
from the Loyal-100 dataset. Finally, users who did not perform at least one transaction between 01/01/2007 and
31/12/2007 were removed from the dataset. An overview of the number of customers removed from the Loyal-100
dataset can be seen in Table 2.2.
161/JNU OLE
Data Mining
Item Attributes
Table 2.3 presents the four item attributes we used for this case study
The item designer, composition, and release season identifiers were translated to nominal categories. The price was
kept in the original format and binned using the Weka toolkit. Items lacking complete data on any of the attributes
were not including in the final dataset due to the problem of incomplete data.
162/JNU OLE
Purchase Transaction Attributes
Table 2.5 presents the two transaction attributes we used for this case study
The transaction date field was provided in one valid format and presented no parsing problems. The date of a
transaction was codified into a binary representation of the calendar season(s) according to the scheme shown in
Table 2.6. This codification scheme results in the “distance” between January and April being equivalent to the
“distance” between September and December, which is intuitive.
The price of each item was kept in the original decimal format and binned using the Weka toolkit. We chose not to
remove discounted items from the dataset. Items with no corresponding user were encountered when the user had
been removed from the dataset due to an aspect of the user demographic attribute causing a problem. An overview
of the number of item transactions removed from the loyal dataset based on the processing and codification step
can be seen in Table 2.7.
163/JNU OLE
Data Mining
As a result of performing these data processing and cleaning steps, we are left with a dataset we refer to as Loyal-
Clean. An overview of the All, Loyal, and the processed and codified dataset, Loyal-Clean, is shown in Table 2.8.
(Source: Cantadore, I., Elliott, D. and jose, J. M. A Case Study of Exploiting Data Mining Techniques for an
Industrial Recommender System [PDF] Available at: <http://ir.ii.uam.es/publications/indrec09.pdf>. [Accessed 30
September 2011].)
Questions
1. Explain the dataset used in this case study.
2. How was demographic attributes used in the above case study?
3. Write a note on purchase transaction attributes used int his case study.
164/JNU OLE
Case Study III
ABSTRACT
Data Mining is gaining popularity as an effective tool for increasing profits in a variety of industries. However,
the quality of the information resulting from the data mining exercise is only as good as the underlying data. The
importance of accurate, accessible data is paramount. A well designed data warehouse can greatly enhance the
effectiveness of the data mining process. This paper will discuss the planning and development of a data warehouse
for a credit card bank. While the discussion covers a number of aspects and uses of the data warehouse, a particular
focus will be on the critical needs for data access pertaining to targeting model development.
The case study will involve developing a Lifetime Value model from a variety of data sources including account
history, customer transactions, offer history and demographics. The paper will discuss the importance of some
aspects of the physical design and maintenance to the data mining process.
INTRODUCTION
One of the most critical steps in any data mining project is obtaining good data. Good data can mean many things:
clean, accurate, predictive, timely, accessible and/or actionable. This is especially true in the development of targeting
models. Targeting models are only as good as the data on which they are developed. Since the models are used to
select names for promotions, they can have a significant financial impact on a company’s bottom line.
The overall objectives of the data warehouse are to assist the bank in developing a totally data driven approach to
marketing, risk and customer relationship management. This would provide opportunities for targeted marketing
programs. The analysis capabilities would include:
• Response Modelling and Analysis
• Risk or Approval Modelling and Analysis
• Activation or Usage Modelling and Analysis
• Lifetime Value or Net Present Value Modelling
• Segmentation and Profiling
• Fraud Detection and Analysis
• List and Data Source Analysis
• Sales Management
• Customer Touchpoint Analysis
• Total Customer Profitability Analysis
The case study objectives focus on the development of a targeting model using information and tools available
through the data warehouse. Anyone who has worked with target model development knows that data extraction
and preparation are often the most time consuming part of model development. Ask a group of analysts how much
of their time spent preparing data is. A majority of them will say over 50%.
165/JNU OLE
Data Mining
60
50
40
30
20
10
0
Business Data Data Mining Analysis of
Objectives Preparation Results and
Development Knowledge
Accumlation
Over the last 10 years, the bank had amassed huge amounts of information about our customer and prospects. The
analysts and modellers knew there was a great amount of untapped value in the data. They just had to figure out a
way to gain access to it. The goal was to design a warehouse that could bring together data from disparate sources
into one central repository.
THE TABLES
The first challenge was to determine which tables should go into the data warehouse. We had a number of issues:
• Capturing response information
• Storing transactions
• Defining date fields
Response information
Responses begin to arrive about a week after an offer is mailed. Upon arrival, the response is put through a risk
screening process. During this time, the prospect is considered ‘Pending.’ Once the risk screening process is
complete, the prospect is either ‘Approved’ or ‘Declined.’ The bank considered two different options for storing
the information in the data warehouse.
• The first option was to store the data in one large table. The table would contain information about those approved
as well as those declined. Traditionally, across all applications, they saw approval rates hover around 50%.
Therefore, whenever analyses was done on either the approved applications (with a risk management focus)
or on the declined population (with a marketing as well as risk management focus), every query needed to go
through nearly double the number of records as necessary.
• The second option was to store the data in three small tables. This accommodated the daily updates and allowed
for pending accounts to stay separate as they awaited information from either the applicant or another data
source.
With applications coming from e-commerce sources, the importance of the “pending” table increased. This table
was examined daily to determine which pending accounts could be approved quickly with the least amount of risk.
In today’s competitive market, quick decisions are becoming a competitive edge. Partitioning the large customer
profile table into three separate tables improved the speed of access for each of the three groups of marketing analysts
who had responsibility for customer management, reactivation and retention, and activation. The latter group was
responsible for both the one-time buyers and the prospect pools.
166/JNU OLE
FILE STRUCTURE ISSUES
Many of the tables presented design challenges. Structural features that provided ease of use for analysts could
complicate the data loading process for the IT staff. This was a particular problem when it came to transaction data.
This data is received on a monthly basis and consists of a string of transactions for each account for the month.
This includes transactions such as balances, purchases, returns and fees. In order to make use of the information at
a customer level it needs to be summarised. The question was how to best organize the monthly performance data
in the data warehouse. Two choices were considered:
Long skinny file: this took the data into the warehouse in much the same form as it arrived.
Each month would enter the table as a separate record. Each year has a separate table. The fields represent the
following:
Wide file: this design has a single row per customer. It is much more tedious to update. But in its final form, it is
much easier to analyze because the data has already been organized into a single customer record. Each year has
a separate table. The layout is as follows:
The final decision was to go with the wide file or the single row per customer design. The argument was that the
manipulation to the customer level file could be automated thus making the best use of the analyst’s time.
DATE ISSUES
Many analyses are performed using date values. In our previous situation, we saw how transactions are received and
updated on a monthly basis. This is useful when comparing values of the same vintage. However, another analyst
might need to compare balances at a certain stage in the customer lifestyle. For example, to track customer balance
cycles from multiple campaigns a field that denotes the load date is needed.
The first type of analysis was tracking monthly activity by the vintage acquisition campaign. For example, calculating
monthly trends of balances aggregated separately for those accounts booked in May 99 and September 99. This
required aggregating the data for each campaign by the “load date” which corresponded to the month in which the
transaction occurred.
167/JNU OLE
Data Mining
The second analyses focused on determining and evaluating trends in the customer life cycle. Typically, customers
who took a balance transfer at the time of acquisition showed balance run-off shortly after the introductory teaser
APR rate expired and the account was reprised to a higher rate. These are the dreaded “rate surfers.” Conversely, a
significant number of customers, who did not take a balance transfer at the time of acquisition, demonstrated balance
build. Over time these customers continued to have higher than average monthly balances. Some demonstrated
revolving behaviour: paying less than the full balance each month and a willingness to pay interest on the revolving
balance. The remainder in this group simply user their credit cards for convenience. Even though they built balances
through debit activity each month, they chose to pay their balances in full and avoid finance charges. These are the
“transactors” or convenience users. The second type of analysis needed to use ‘Months on books’, regardless of
the source campaign. This analysis required computation of the account age by looking at both the date the account
was open as well as the “load date” of the transaction data. However, if the data mining task is to also understand
this behaviour in the context of campaign vintage which was mentioned earlier, there is a another consideration.
Prospects for the “May 99” campaign were solicited in May of 1999. However, many new customers did not use
their card until June or July of 1999. There were three main reasons:
• some wanted to compare their offer to other offers;
• processing is slower during May and June; and
• some waited until a specific event (e.g. purchase of a large present at the Christmas holidays) to use their card
for the first time.
At this point the data warehouse probably needs to store at least the following date information:
• Date of the campaign
• Date the account first opened
• Date of the first transaction
• Load date for each month of data
The difference between either “b” or “c” above and “d” can be used as the measure used to index account age or
month on books.
No single date field is more important than another but multiple date files are problem necessary if vintage as well
as customer life-cycle analyses are both to be performed.
Customer ID – a unique numeric or alpha-numeric code that identifies the customer throughout his entire lifecycle.
This element is especially critical in the credit card industry where the credit card number may change in the event
of a lost or stolen card. But it is essential in any table to effectively link and tract the behaviour of and actions taken
on an individual customer.
Household ID – a unique numeric or alpha-numeric code that identifies the household of the customer through his
or her entire lifecycle. This identifier is useful in some industries where products or services are shared by more
than one member of a household.
Account number – a unique numeric or alpha-numeric code that relates to a particular product or service. One
customer can have several account numbers.
168/JNU OLE
Customer name – the name of a person or a business. It is usually broken down into multiple fields: last name, first
name, middle name or initial, salutation.
Address – the street address is typically broken into components such as number, street, suite or apartment number,
city, state, zip+4. Some customer tables have a line for a P.O. Box. With population mobility about 10% per year,
additional fields that contain former addresses are useful for tracking and matching customers to other files.
Phone number – current and former numbers for home and work.
Demographics – characteristics such as gender, age, income, etc. may be stored for profiling and modelling.
Products or services – the list of products and product identification numbers varies by company. An insurance
company may list all the policies along with policy numbers. A bank may list all the products across different
divisions of the bank including checking, savings, credit cards, investments, loans, and more. If the number of
products and product detail is extensive, this information may be stored in a separate table with a customer and
household identifier.
Offer detail – the date, type of offer, creative, source code, pricing, distribution channel (mail, telemarketing, sales
rep, e-mail), and any other details of an offer. Most companies look for opportunities to cross-sell or up-sell their
current customers. There could be numerous “offer detail” fields in a customer record, each representing an offer
for an additional product or service.
Model Scores – response, risk, attrition, profitability scores and/or any other scores that are created or purchased.
Transaction table
The Transaction Table contains records of customer activity. It is the richest and most predictive information but
can be the most difficult to access. Each record represents a single transaction. So there are multiple records for
each customer. In order to use this data for modelling, it must be summarized and aggregated to a customer level.
The following lists key elements of the Transaction Table:
Customer ID – defined above.
Household ID – defined above.
Transaction Type – The type of credit card transaction such as charge, return, or fee (annual, overlimit, late).
Transaction Date– The date of the transaction
Transaction Amount – The dollar amount of the transaction.
With an average amount of solicitation activity, this type of table can become very large. It is important to perform
analysis to establish business rules that control the maintenance of this table. Fields like ‘date of first offer’ is usually
correlated with response behaviour. The following list details some key elements in an Offer History Table:
Prospect ID/Customer ID – as in the Customer Information Table, this is a unique numeric or alphanumeric code
that identifies the prospect for a specific length of time. This element is especially critical in the credit card industry
where the credit card number may change in the event of a lost or stolen card. But it is essential in any table to
effectively tract the behaviour of and actions taken on an individual customer.
169/JNU OLE
Data Mining
Household ID – a unique numeric or alpha-numeric code that identifies the household of the customer through his
entire lifecycle. This identifier is useful in some industries where products or services are shared by more than one
member of a household.
Prospect name* – the name of a person or a business. It is usually broken down into multiple fields: last name, first
name, middle name or initial, salutation.
Address* – the street address is typically broken into components such as number, street, suite or apartment number,
city, state, zip+4. As in the Customer Table, some prospect tables have a line for a P.O. Box. Additional fields that
contain former addresses are useful for matching prospects to outside files.
Phone number – current and former numbers for home and work.
Offer Detail – includes the date, type of offer, creative, source code, pricing, distribution channel (mail, telemarketing,
sales rep, email) and any other details of the offer. There could be numerous groups of “offer detail” fields in a
prospect or customer record, each representing an offer for an additional product or service.
Offer summary – date of first offer (for each offer type), best offer (unique to product or service), etc.
Model scores* – response, risk, attrition, profitability scores and/or any scores other that are created or
purchased.
Predictive data* – includes any demographic, psychographic or behavioural data. *These elements appear only
on a Prospect Offer History Table. The Customer Table would support the Customer Offer History Table with
additional data.
To predict Lifetime Value, data was pulled from the Offer History Table from three campaigns with a total of 966,856
offers. To reduce the amount of data for analysis and maintain the most powerful information, a sample is created
using all of the ‘Activation’ and 1/25th of the remaining records. This includes non-responders and non-activating
responders. We define an ACTIVE as a customer with a balance at three months. The following code creates the
sample dataset:
170/JNU OLE
DATA A B;
SET LIB.DATA;
IF 3MON_BAL > 0 THEN OUTPUT A;
ELSE OUTPUT B;
DATA LIB.SAMPDATA;
SET A B (WHERE=(RANUNI(5555) < .04));
SAMP_WGT = 25;
RUN
This code is putting into the sample dataset, all customers who activated and a 1/25th random sample of the balance
of accounts. It also creates a weight variable called SAMP_WGT with a value of 25.
The non-responders and non-activated responders are grouped together since our target is active responders. This
gives us a manageable sample size of 74,944.
Model developmnt
The first component of the LTV, the probability of activation, is based on a binary outcome, which is easily modelled
using logistic regression. Logistic regression uses continuous values to predict the odds of an event happening. The
log of the odds is a linear function of the predictors. The equation is similar to the one used in linear regression with
the exception of the use of a log transformation to the independent variable. The equation is as follows:
Through analysis, the following variables were determined to be the most predictive.
SAM_OFF1 – received the same offer one time in the past 6 months.
DIF_OFF1 – received a different offer one time in the past 6 months.
SAM_OFF2 – received the same offer more than one time in the past 6 months.
DIF_OFF2 – received a different offer more than one time in the past 6 months.
171/JNU OLE
Data Mining
The product being modelled is Product 2. The following code creates the variables for modelling:
SAM_OFF1 = (IF NPROD2 = 1);
SAM_OFF2 = (IF NPROD2 > 1);
DIF_OFF1 = (IF SUM(NPROD1, NPROD3, NPROD4) =
1);
DIF_OFF2 = (IF SUM(NPROD1, NPROD3, NPROD4) >
1);
If the prospect has never received an offer, then the values for the four named variables will all be 0.
The logistic model output (see Appendix D) shows two forms of TOT_BAL to be significant in combination:
TOT_BAL TOT_B_SQ
These forms will be introduced into the final model.
Partition data
The data are partitioned into two datasets, one for model development, and one for validation. This is accomplished
by randomly splitting the data in half using the following SAS® code:
DATA LIB.MODEL LIB.VALID;
SET LIB.DATA;
IF RANUNI(0) < .5 THEN OUTPUT LIB.MODEL;
ELSE OUTPUT LIB.VALID;
RUN;
If the model performs well on the model data and not as well on the validation data, the model may be over-fitting the
data. This happens when the model memorizes the data and fits the models to unique characteristics of that particular
data. A good, robust model will score with comparable performance on both the model and validation datasets. As a
result of the variable preparation, a set of ‘candidate’ variables has been selected for the final model. The next step
is to choose the model options. The backward selection process is favoured by some modellers because it evaluates
all of the variables in relation to the dependent variable while considering interactions among the independent or
predictor variables. It begins by measuring the significance of all the variables and then removing one at a time until
only the significant variables remain.
172/JNU OLE
The sample weight must be included in the model code to recreate the original population dynamics. If you eliminate
the weight, the model will still produce correct ranking-ordering but the actual estimates for the probability of a
‘paid-sale’ will be incorrect. Since our LTV model uses actual estimates, we will include the weights. The following
code is used to build the final model.
The resulting model has 7 predictors. The parameter estimate is multiplied times the value of the variable to
create the final probability. The strength of the predictive power is distributed like a chi-square so we look to that
distribution for significance. The higher the chi-square, the lower is the probability of the event occurring randomly
(pr > chi-square). The strongest predictor is the variable DIFOFF2 which demonstrates the power of offer history
on the behaviour of a prospect. Introducing offer history variables into the acquisition modelling process has been
single most significant improvement in the last three years. The following equation shows how the probability is
calculated, once the parameter estimates have been calculated:
This creates the final score, which can be evaluated using a gains table (see Appendix D). Sorting the dataset by
the score and dividing it into 10 groups of equal volume creates the gains table. This is called a Decile Analysis.
The validation dataset is also scored and evaluated in a gains table or Decile Analysis. Both of these tables show
strong rank ordering. This can be seen by the gradual decrease in predicted and actual probability of ‘Activation’
from the top decile to the bottom decile. The validation data shows similar results, which indicates a robust model.
To get a sense of the ‘lift’ created by the model, a gains chart is a powerful visual tool. The Y-axis represents the
% of ‘Activation’ captured by each model. The X-axis represents the % of the total population mailed. Without the
model, if you mail 50% of the file, you get 50% of the potential ‘Activation’. If you use the model and mail the
same percentage, you capture over 97% of the ‘Activation’. This means that at 50% of the file, the model provides
a ‘lift’ of 94% {(97-50)/50}.
Financial assessment
To get the final LTV we use the formula:
LTV = Pr(Paid Sale) * Risk Index Score* Expected
Account Profit - Marketing Expense
At this point, we apply the risk matrix score and expected account profit value. The financial assessment shows
the models ability to select the most profitable customers. Notice how the risk score index is lower for the most
responsive customers. This is common in direct response and demonstrates ‘adverse selection’. In other words, the
riskier prospects are often the most responsive. At some point in the process, a decision is made to mail a percent of
the file. In this case, you could consider the fact that in decile 7, the LTV becomes negative and limit your selection
to deciles 1 through 6. Another decision criterion could be that you need to be above a certain ‘hurdle rate’ to cover
fixed expenses. In this case, you might look at the cumulative LTV to be above a certain amount such as $30.
Decisions are often made considering a combination of criteria.
173/JNU OLE
Data Mining
The final evaluation of your efforts may be measured in a couple of ways. You could determine the goal to mail fewer
pieces and capture the same LTV. If we mail the entire file with random selection, we would capture $13,915,946
in LTV. This has a mail cost of $754,155. By mailing 5 deciles using the model, we would capture $14,042,255 in
LTV with a mail cost of only $377,074. In other words, with the model we could capture slightly more LTV and
cut our marketing cost in half. Or, we can compare similar mail volumes and increase LTV. With random selection
at 50% of the file, we would capture $6,957,973 in LTV. Modelled, the LTV would climb to $14,042,255. This is a
lift of over 100% ((14042255-6957973)/ 6957973 = 1.018).
Conclusion
Successful data mining and predictive modelling depends on quality data that is easily accessible. A well constructed
data warehouse allows for the integration of Offer History which has an excellent predictor of Lifetime Value.
(Source: Rud, C, O., Data Warehousing for Data Mining: A Case Study [PDF] Available at: <http://www2.sas.com/
proceedings/sugi25/25/dw/25p119.pdf>. [Accessed 30 September 2011].)
Question
1. How many datasets is the data portioned into?
2. Which is the first challenge mentioned in the above case study?
3. What are the analysis capabilities?
174/JNU OLE
Bibliography
References
• Adriaans, P., 1996. Data Mining, Pearson Education India.
• Alexander, D., Data Mining [Online] Available at: <http://www.laits.utexas.edu/~norman/BUS.FOR/course.
mat/Alex/>. [Accessed 9 September 2011].
• Berlingerio, M., 2009. Temporal mining for interactive workflow data analysis [Video Online] Available at:
<http://videolectures.net/kdd09_berlingerio_tmiwda/>. [Accessed 12 September].
• Dr. Krie, H. P., Spatial Data Mining [Online] Available at: <http://www.dbs.informatik.uni-muenchen.de/
Forschung/KDD/SpatialKDD/>. [Accessed 9 September 2011].
• Dr. Kuonen, D., 2009. Data Mining Applications in Pharma/BioPharma Product Development [Video Online]
Available at: <http://www.youtube.com/watch?v=kkRPW5wSwNc>. [Accessed 12 September 2011].
• Galeas, Web mining [Online PDF} Available at: <http://www.galeas.de/webmining.html>. [Accessed 12
September 2011].
• Hadley, L., 2002. Developing a Data Warehouse Architecture [Online] Available at: <http://www.users.qwest.
net/~lauramh/resume/thorn.htm>. [Accessed 8 September 2011].
• Han, J. and Kamber, M., 2006. Data Mining: Concepts and Techniques, 2nd ed., Diane Cerra.
• Han, J., Kamber, M. and Pei, J., 2011. Data Mining: Concepts and Techniques, 3rd ed., Elsevier.
• http://nptel.iitm.ac.in, 2008. Lecture - 34 Data Mining and Knowledge Discovery [Video Online] Available at
:< http://www.youtube.com/watch?v=m5c27rQtD2E>. [Accessed 12 September 2011].
• http://nptel.iitm.ac.in, 2008. Lecture - 35 Data Mining and Knowledge Discovery Part II [Video Online] Available
at: <http://www.youtube.com/watch?v=0hnqxIsXcy4&feature=relmfu>. [Accessed 12 September 2011].
• Humphries, M., Hawkins, M. W. And Dy, M. C., 1999. Data warehousing: architecture and implementation,
Prentice Hall Profesional.
• Intricity101, 2011. What is OLAP? [Video Online] Available at: <http://www.youtube.com/watch?v=2ryG3Jy
6eIY&feature=related >. [Accessed 12 September 2011].
• Kimball, R., 2006. The Data warehouse Lifecycle Toolkit, Wiley-India.
• Kumar, A., 2008. Data Warehouse Layered Architecture 1 [Video Online] Available at: <http://www.youtube.
com/watch?v=epNENgd40T4>. [Accessed 11 September 2011].
• Larose, D. T., 2006. Data mining methods and models, John Wiley and Sons.
• Learndatavault, 2009. Business Data Warehouse (BDW) [Video Online] Available at: <http://www.youtube.
com/watch?v=OjIqP9si1LA&feature=related>. [Accessed 12 September 2011].
• Lin W., Orgun M, A. and Williams G. J., An Overview of Temporal Data Mining [Online PDF] Available at:
<http://togaware.redirectme.net/papers/adm02.pdf>. Accessed 9 September 2011].
• Liu, B., 2007. Web data mining: exploring hyperlinks, contents, and usage data, Springer.
• Mailvaganam, H., 2007. Data Warehouse Project Management [Online] Available at: <http://www.dwreview.
com/Articles/Project_Management.html>. [Accessed 8 September 2011].
• Maimom, O. and Rokach, L., 2005. Data mining and knowledge discovery handbook, Springer Science and
Business.
• Maimom, O. and Rokach, L., Introduction To Knowledge Discovery In Database [Online PDF] Available at:
<http://www.ise.bgu.ac.il/faculty/liorr/hbchap1.pdf>. [Accessed 9 September 2011].
• Mento, B. and Rapple, B., 2003. Data Mining and Warehousing [Online] Available at: <http://www.arl.org/
bm~doc/spec274webbook.pdf>. [Accessed 9 September 2011].
• Mitsa, T., 2009. Temporal Data Mining, Chapman & Hall/CRC.
• OracleVideo, 2010. Data Warehousing Best Practices Star Schemas [Video Online] Available at: < http://www.
youtube.com/watch?v=LfehTEyglrQ>. [Accessed 12 September 2011].
175/JNU OLE
Data Mining
• Orli, R and Santos, F., 1996. Data Extraction, Transformation, and Migration Tools [Online] Available at:
<http://www.kismeta.com/extract.html>. [Accessed 9 September 2011].
• Ponniah, P., 2001. DATA WAREHOUSING FUNDAMENTALS-A Comprehensive Guide for IT Professionals,
Wiley-Interscience Publication.
• SalientMgmtCompany, 2011. Salient Visual Data Mining [Video Online] Available at: <http://www.youtube.
com/watch?v=fosnA_vTU0g>. [Accessed 12 September 2011].
• Seifert, J, W., 2004. Data Mining: An Overview [Online PDF] Available at: <http://www.fas.org/irp/crs/RL31798.
pdf>. [Accessed 9 September 2011].
• Springerlink, 2006. Data Mining System Products and Research Prototypes [Online PDF] Available at: <http://
www.springerlink.com/content/2432076500506017/>. [Accessed 12 September 2011].
• SQLUSA, 2009. SQLUSA.com Data Warehouse and OLAP [Video Online] Available at: <http://www.youtube.
com/watch?v=OJb93PTHsHo>. [Accessed 12 September 2011].
• StatSoft, 2010. Data Mining, Cluster Techniques - Session 28 [Video Online] Available at: <http://www.youtube.
com/watch?v=WvR_0Vs1U8w>. [Accessed 12 September 2011].
• StatSoft, 2010. Data Mining, Model Deployment and Scoring - Session 30 [Video Online] Available at: <http://
www.youtube.com/watch?v=LDoQVbWpgKY>. [Accessed 12 September 2011].
• Statsoft, Data Mining Techniques [Online] Available at: <http://www.statsoft.com/textbook/data-mining-
techniques/>. [Accessed 12 September 2011].
• Swallacebithead, 2010. Using Data Mining Techniques to Improve Forecasting [Video Online] Available at:
<http://www.youtube.com/watch?v=UYkf3i6LT3Q>. [Accessed 12 September 2011].
• University of Magdeburg, 2007. 3D Spatial Data Mining on Document Sets [Video Online] Available at: <http://
www.youtube.com/watch?v=jJWl4Jm-yqI>. [Accessed 12 September 2011].
• Wan, D., 2007. Typical data warehouse deployment lifecycle [Online] Available at: <http://dylanwan.wordpress.
com/2007/11/02/typical-data-warehouse-deployment-lifecycle/>. [Accessed 12 September 2011].
• Zaptron, 1999. Introduction to Knowledge-based Knowledge Discovery [Online] Available at: <http://www.
zaptron.com/knowledge/>. [Accessed 9 September 2011].
Recommended Reading
• Chang, G., 2001. Mining the World Wide Web: an information search approach, Springer.
• Chattamvelli, R., 2011. Data Mining Algorithms, Alpha Science International Ltd.
• Han, J., Kamber, M. 2006. Data Mining: Concepts and Techniques, 2nd ed., Morgan Kaufmann.
• Jarke, M., 2003. Fundamentals of data warehouses, 2nd ed., Springer.
• Kantardzic, M., 2001. Data Mining: Concepts, Models, Methods, and Algorithms, 2nd ed., Wiley-IEEE.
• Khan, A., 2003. Data Warehousing 101: Concepts and Implementation, iUniverse.
• Liu, B., 2011. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, 2nd ed., Springer.
• Markov, Z. and Larose, D. T., 2007. Data mining the Web: uncovering patterns in Web content, structure, and
usage, Wiley-Interscience.
• Parida, R., 2006. Principles & Implementation of Data Warehousing, Firewell Media.
• Ponniah, P., 2001. Data Warehousing Fundamentals-A Comprehensive Guide for IT Professionals, Wiley-
Interscience Publication.
• Ponniah, P., 2010. Data Warehousing Fundamentals for IT Professionals, 2nd ed., John Wiley and Sons.
• Prabhu, C. S. R., 2004. Data warehousing: concepts, techniques, products and applications, 2nd ed., PHI
Learning Pvt. Ltd.
• Pujari, A. K., 2001. Data mining techniques, 4th ed., Universities Press.
• Rainardi, V., 2007. Building a data warehouse with examples in SQL Server, Apress.
• Roddick, J. F. and Hornsby, K., 2001. Temporal, spatial, and spatio-temporal data mining, Springer.
176/JNU OLE
• Scime, A., 2005. Web mining: applications and techniques, Idea Group Inc (IGI).
• Stein, A., Shi, W. and Bijker, W., 2008. Quality aspects in spatial data mining, CRC Press.
• Thuraisingham, B. M., 1999. Data mining: technologies, techniques, tools, and trends, CRC Press.
• Witten, I. H. and Frank, E., 2005. Data mining: practical machine learning tools and techniques, 2nd ed.,
Morgan Kaufmann.
177/JNU OLE
Data Mining
Chapter II
1. b
2. a
3. d
4. a
5. b
6. b
7. c
8. c
9. a
10. a
Chapter III
1. a
2. c
3. b
4. d
5. a
6. d
7. b
8. b
9. c
10. a
Chapter IV
1. a
2. c
3. a
4. a
5. c
6. c
7. a
8. a
9. c
10. d
11. a
178/JNU OLE
Chapter V
1. d
2. c
3. a
4. a
5. a
6. a
7. b
8. b
9. d
10. a
Chapter VI
1. c
2. a
3. b
4. d
5. c
6. a
7. b
8. b
9. b
10. a
Chapter VII
1. a
2. c
3. d
4. a
5. b
6. d
7. c
8. c
9. a
10. b
179/JNU OLE