DataWarehouseMining Complete Notes
DataWarehouseMining Complete Notes
Descriptive mining tasks characterize the general properties of the data in the database.
Predictive mining tasks perform inference on the current data in order to make predictions.
Data characterization, by summarizing the data of the class under study (often called the target classes)
1
For Example:
A data mining system should be able to produce a description summarizing the characteristics of customers who
spend more than $1,000 a year at All Electronics. The result could be a profile of the customers, such as they
are 40-50 years old, employed, and have excellent credit ratings.
Data discrimination, by comparison of target class with one or a set of comparative classes (often called the
contrasting classes)
For Example:
A data mining system should be able to compare two groups of All Electronics customers, such as those who
shops for computer products regularly versus those who rarely shop for such products. The resulting description
provides a general comparative profile of the customers, such as 80% of the customers who frequently purchase
computer products are between 20 and 40 years old and have a university education, whereas 60% of the
customers who infrequently buy such products are either seniors or youths, and have no university degree.
Example:
Suppose, as sales manager of ALLElectronics, you would like to classify a large set of items in the store, based
on three kinds of responses to a sales campaign: good, mild and no response. You would like to drive a model
for each of these three classes based on the descriptive features of the items, such as price, brand, place_made,
type and category.
The resulting classification should maximally distinguish each class from others, presenting an organized
picture of the data set.
Cluster Analysis
The objects are clustered or grouped based on the principle of maximizing the intraclass similarity and
minimizing the interclass similarity.
The clusters of objects are formed so that objects within a cluster have high similarity in comparison to
one another.
But are very dissimilar to objects in other clusters.
Example:
Cluster analysis can be performed on ALLElectronics customer data in order to identify homogenous
subpopulations of customers. These clusters may represent individual target groups for marketing.
2
Outlier Analysis
A database may contain data objects that do not comply with the general behavior or model of the data. These
data objects are called outliers.
Most data mining methods discard outliers as noise or exceptions.
In some applications such as fraud detection, the rare events can be more interesting than the more regularly
occurring ones.
Example:
Outlier analysis may uncover fraudulent usage of credit cards by detecting purchases of extremely large
amounts for a given account number in comparison to regular charges incurred by the same account.
Evolution Analysis
Data evolution analysis describes and models regularities or trends for objects whose behavior change
over time.
Although this may include characterization, discrimination, association and correlation analysis,
classification, prediction, or clustering of time related data.
Distinct features of such as analysis include time-series data analysis, sequence or periodicity pattern
matching, and similarity-based data analysis.
Example:
Suppose that you have the major stock market (time-series) data of the last several years available from the New
York Stock Exchange and you would like to invest in shares of high-tech industrial companies.
A data mining study of stock exchange data my identify stock evolution regularities for overall stocks and for
the stocks of particular companies. Such regularities may help predict future trends in stock market prices,
contributing to your decision making regarding stock investments.
3
The Stages of KDD (Knowledge Discovery in Database)
The KDD process as follows:
1. Data Cleaning
2. Data Integration
3. Data Selection
4. Data Transformation
5. Data Mining
6. Pattern Evaluation
7. Knowledge Presentation
Cleaning:
To remove noise and inconsistent data
Data Integration:
Where multiple data sources may be combined
Data Selection:
Where data relevant to the analysis task are retrieved from the database.
Data Transformation:
Where data are transformed or consolidated into forms appropriate for mining by performing summary or
aggregation operation, for instance.
Data Mining:
An essential process where intelligent methods are applied in order to extract data patterns.
(We agree that data mining is a step in the knowledge discovery process)
Pattern Evaluation:
To identify the truly interesting patterns representing knowledge based on some interestingness measures
Knowledge Presentation:
Where visualization and knowledge representation techniques are used to present the mined knowledge to the
user.
4
The Major Components of Typical Data Mining
Database, Data Warehouse, World Wide Web, or other information repository
Database or Data Warehouse Server
Knowledge Based
Data Mining Engine
Pattern Evaluation Module
User Interface
Data mining is
5
• A cybernetic magic that will turn your data into gold. It’s the process and result of knowledge
production, knowledge discovery and knowledge management.
• Once the patterns are found Data Mining process is finished.
• Queries to the database are not DM.
What is Data Warehouse?
• According to W. H. Inmon, a data warehouse is a subject-oriented, integrated, time-variant,
nonvolatile collection of data in support of management decisions.
• “A data warehouse is a copy of transaction data specifically structured for querying and reporting” –
Ralph Kimball
• Data Warehousing is the process of building a data warehouse for an organization.
• Data Warehousing is a process of transforming data into information and making it available to users in
a timely enough manner to make a difference
Subject Oriented
Non Volatile
• Data Warehouse is relatively Static in nature.
• Not updated in real-time but data in the data warehouse is loaded and refreshed from operational
systems, it is not updated by end users.
Data warehousing helps business managers to :
– Extract data from various source systems on different platforms
– Transform huge data volumes into meaningful information
– Analyze integrated data across multiple business dimensions
– Provide access of the analyzed information to the business users anytime anywhere
OLTP vs. Data Warehouse
• Online Transaction Processing (OLTP) systems are tuned for known transactions and workloads while
workload is not known a priori in a data warehouse
• OLTP applications normally automate clerical data processing tasks of an organization, like data entry
and enquiry, transaction handling, etc. (access, read, update)
• Special data organization, access methods and implementation methods are needed to support data
warehouse queries (typically multidimensional queries)
– e.g., average amount spent on phone calls between 9AM-5PM in Kathmandu during the
month of March, 2012
• OLTP Data Warehouse
– Application Oriented Subject Oriented
– Used to run business Used to analyze business
– Detailed data Summarized and refined
– Current up to date Snapshot data
– Isolated Data Integrated Data
– Repetitive access Ad-hoc access
– Clerical User Knowledge User (Manager)
6
– Database Size 100MB -100 GB Database Size 100 GB - few terabytes
◦
Risk analysis and management
Where does the data come from?—Credit card transactions, loyalty cards, discount coupons, customer
complaint calls, plus (public) lifestyle studies
Target marketing
◦ Find clusters of “model” customers who share the same characteristics: interest, income level,
spending habits, etc.
◦ Determine customer purchasing patterns over time
Cross-market analysis—Find associations/co-relations between product sales, & predict based on such
association
Customer profiling—What types of customers buy what products (clustering or classification)
Customer requirement analysis
◦ Identify the best products for different groups of customers
◦ Predict what factors will attract new customers
Provision of summary information
◦ Multidimensional summary reports
◦ Statistical summary information (data central tendency and variation)
◦
Corporate Analysis & Risk Management
7
Finance planning and asset evaluation
◦ cash flow analysis and prediction
◦ contingent claim analysis to evaluate assets
◦ cross-sectional and time series analysis (financial-ratio, trend analysis, etc.)
Resource planning
◦ summarize and compare the resources and spending
Competition
◦ monitor competitors and market directions
◦ group customers into classes and a class-based pricing procedure
◦ set pricing strategy in a highly competitive market
• Data selection
• Cleaning
• Enrichment
• Coding
• Data Mining
• Reporting
8
Figure: Knowledge Discovery in Databases (KDD) Process
Data Selection
Once you have formulated your informational requirements, the nest logical step is to collect and select the data
you need. Setting up a KDD activity is also a long term investment. A data environment will need to download
from operational data on a regular basis, therefore investing in a data warehouse is an important aspect of the
whole process.
Cleaning
9
Almost all databases in large organizations are polluted and when we start to look at the data from a data mining
perspective, ideas concerning consistency of data change. Therefore, before we start the data mining process, we
have to clean up the data as much as possible, and this can be done automatically in many cases.
Figure: De-duplication
Enrichment
Matching the information from bought-in databases with your own databases can be difficult. A well-known
problem is the reconstruction of family relationships in databases. In a relational environment, we can simply
join this information with our original data.
Figure: Enrichment
10
Figure: Enriched Table
Coding
We can apply following coding technique:
(1) Address to regions
(2) Birthdate to age
(3) Divide income by 1000
(4) Divide credit by 1000
(5) Convert cars yes/no to 1/0
(6) Convert purchased date to months numbers
11
Data Mining
It is a discovery stage in KDD process.
• Data mining refers to extracting or “mining” knowledge from large amounts of data.
• Many people treat data mining as a synonym for another popularly used term, Knowledge Discovery
from Database, or KDD.
• Alternatively, others view data mining as simply an essential step in the process of knowledge
discovery.
Figure: Averages
12
Figure: Age distribution of readers
Reporting
• It uses two functions:
1. Analysis of the results
2. Application of results
13
• Visualization and knowledge representation techniques are used to present the mined knowledge to the
user.
Other
Algorithm
Database Disciplines
Statistics
Systems
Machine
Learning
Data Mining Visualization
14
Data Warehouse Architecture
15
– scheduling the execution of queries.
• In some cases, the query manager also generates query profiles to allow the warehouse manager to
determine which indexes and aggregations are appropriate.
Meta Data: This area of the DW stores all the meta-data (data about data) definitions used by all the processes
in the warehouse.
• Used for a variety of purposes:
– Extraction and loading processes
– Warehouse management process
– Query management process
• End-user access tools use meta-data to understand how to build a query.
• Most vendor tools for copy management and end-user data access use their own versions of meta-data.
Lightly and Highly Summarized Data: It stores all the pre-defined lightly and highly aggregated data
generated by the warehouse manager.
• The purpose of summary info is to speed up the performance of queries.
• Removes the requirement to continually perform summary operations (such as sort or group by) in
answering user queries.
Archive/Backup Data: It stores detailed and summarized data for the purposes of archiving and backup.
• May be necessary to backup online summary data if this data is kept beyond the retention period for
detailed data.
• The data is transferred to storage archives such as magnetic tape or optical disk.
16
Applications of Data Mining
• Data mining is an interdisciplinary field with wide and diverse applications
– There exist nontrivial gaps between data mining principles and domain-specific applications
• Some application domains
– Financial data analysis
– Retail industry
– Telecommunication industry
Biological data analysis
17
• Fraudulent pattern analysis and the identification of unusual patterns
– Identify potentially fraudulent users and their typical usage patterns
– Detect attempts to gain fraudulent entry to customer accounts
– Discover unusual patterns which may need special attention
• Multidimensional association and sequential pattern analysis
– Find usage patterns for a set of communication services by customer group, by month, etc.
– Promote the sales of specific services
– Improve the availability of particular services in a region
• Use of visualization tools in telecommunication data analysis
• The largest challenge a data miner may face is the sheer volume of data in the data warehouse.
18
• It is quite important, then, that summary data also be available to get the analysis started.
• A major problem is that this sheer volume may mask the important relationships the data miner is
interested in.
• The ability to overcome the volume and be able to interpret the data is quite important.
Warehouse Products
• Computer Associates -- CA-Ingres
• Hewlett-Packard -- Allbase/SQL
• Informix -- Informix, Informix XPS
• Microsoft -- SQL Server
• Oracle -- Oracle7, Oracle Parallel Server
• Red Brick -- Red Brick Warehouse
• SAS Institute -- SAS
• Software AG -- ADABAS
• Sybase -- SQL Server, IQ, MPP
19
Unit Two
DBMS Vs Data Warehouse
DBMS is the whole system used for managing digital databases, which allows storage of database content,
creation/maintenance of data, search and other functionalities.
Whereas a data warehouse is a place that store data for archival, analysis and security purposes. A data
warehouse is made up of a single computer or several computers connected together to form a computer system.
DBMS, sometimes just called a database manager, is a collection of computer program that is dedicated for the
management (i.e organization, storage and retrieval) of all databases that are installed in the system (i.e hard
drive or network).
Data warehouses play a major role in Decision Support Systems (DSS). DSS is a technique used by
organizations to develop and identify facts, trends or relationships that would help them to make better decisions
to achieve their organizational goals.
The key difference between DBMS and data warehouse is the fact that a data warehouse can be treated as a type
of a database or a special kind of database, which provides special facilities for analysis, and reporting while,
DBMS is the overall system which manages a certain database.
Data warehouses mainly store data for the purpose of reporting and analysis that would help an organization in
the process making decisions, while a DBMS is a computer application that is used to organize, store and
retrieve data. A data warehouse needs to use a DBMS to make data organization and retrieval more efficient.
Data Mart
Definition
A data mart is a simple form of data warehouse that is focused on a single subject ( or functional area)
such as Sales, Finance, or Marketing.
Data marts are often built and controlled by a single department within an organization.
Given their single-subject focus, data marts usually draw data from only a few sources.
The sources could be internal operational systems, a central data warehouse, or external data.
20
There are two basic types of data marts:
Dependent
Independent
Dependent Data Mart:
The dependent data marts draw data from a central data warehouse that has already been created.
Independent Data Mart:
In contrast, are standalone systems built by drawing data directly from operational or external sources of data, or
both.
Metadata
Database that describes various aspects of data in the warehouse
Administrative Metadata: Source database and contents, Transformations required, History of Migrated
data
End User Metadata:
Definition of warehouse data
Descriptions of it
Consolidation Hierarchy
Uses of Metadata
Map source system data to data warehouse tables
Generate data extract, transform, and load procedures for import jobs
Help users discover what data are in the data warehouse
In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds
the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube.
21
Stars, Snowflakes, and Fact Constellations: Schemas for Multidimensional Databases
The entity-relationship data model is commonly used in the design of relational databases, where a
database schema consists of a set of entities and the relationships between them.
Such a data model is appropriate for on-line transaction processing.
A data warehouse, however, requires a concise, subject-oriented schema that facilitates on-line data
analysis.
The most popular data model for a data warehouse is a multidimensional model.
Such a model can exist in the form of a star schema, a snowflake schema, or a fact constellation
schema.
Stars Schema
The most common modeling paradigm is the star schema, in which the data warehouse contains
A large central table (Fact Table) containing the bulk of the data, with no redundancy
A set of smaller attendant tables(Dimension Tables), one for each dimension.
The Schema graph resembles a starburst, with the dimension tables displays displayed in a radial pattern around
the central fact table.
22
Snowflake Schema
The snowflake schema is a variant of the star schema model, where some dimension tables are
normalized, thereby further splitting the data into additional tables.
The resulting schema graph forms a shape similar to a snowflake.
The major difference between the snowflake and star schema models is that the dimension tables of the
snowflake model may be kept in normalized form to reduce redundancies.
Hence, although the snowflake schema reduces redundancy, it is not as popular as the star schema in
data warehouse design.
23
Fact Constellation
Sophisticated application may require multiple fact tables to share dimension tables.
This kind of schema can be viewed as a collection of stars, and hence is called a galaxy schema or a fact
constellation.
24
Fig: Fact Constellation Schema of a data warehouse for sales
A data warehouse collects information about subjects that span the entire organizations, such as customers,
items, sales, assets, and personnel, and thus its scope is enterprise-wide.
For data warehouse, the fact constellation schema is commonly used, since it can model multiple, interrelated
subjects.
A data mart, on the other hand, is a department subset of the data warehouse that focuses on selected subjects,
and thus its scope is department-wide. For data mart, the star or snowflake schema are commonly used, since
both are geared toward modeling single subjects, although the star schema is more popular and efficient.
Examples:
Star Schema:
define cube sales_star [time, item, branch, location]: dollar_sold=sum(sales_in_dollars), units_sold=count(*)
define dimension time as (time_key, day, day_of_weeks, month, quarter, year )
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
Define dimension location as (location_key, street, city, province_or_state, country)
25
OLAP Operations in the Multidimensional Data Model
A number of OLAP data cube operations exist to materialize these different views, allowing interactive
querying and analysis of the data at hand.
Example: (OLAP Operations)
Let’s look at some typical OLAP operations for multidimensional data. Each of the operations described below
is illustrated in the fig.
At the center of the fig. is a data cube for ALLElectronics sales. The cube contains the dimensions location,
time, and item.
Where location is aggregated with respect to city values, time is aggregated with respect to quarters and item is
aggregated with respect to item types.
The data examined are for the cities Chicago, New York, Toronto and Vancouver.
Drill-Down:
The Drill-down is the reverse of roll-up. It navigates from less details data to more detailed data.
Drill-down can be realized by either stepping down a concept hierarchy for a dimension or introducing
additional dimensions.
A drill-down adds more detail to the given data, it can also be performed by adding new dimensions to a cube.
Pivot (rotate):
Pivot (also called rotate) is a visualization operation that rotates the data axes in view in order to provide an
alternative presentation of the data.
26
Unit Three
Data Warehouse Architecture
In this section, we discuss issues regarding data warehouse architecture:
To design and construct a data warehouse
A three tier data warehouse architecture
Third, a data warehouse facilitates customer relationship management because it provides a consistent view
of customers and item across all lines of business, all departments and all markets.
Finally, a data warehouse may bring about cost reduction by tracking trends, patterns, and exceptions over
long periods in a consistent and reliable manner.
Four Different views regarding the design of a data warehouse must be considered:
The top-down view
The data source view
The data warehouse view
The business query view
The top-down
It view allows the selection of the relevant information necessary for the data warehouse. This information
matches the current and future business needs.
27
1. The bottom tier is a warehouse database server that is almost always a relational database system.
2. The middle tier is an OLAP server that is typically implemented using either (i) a relational OLAP
(ROLAP) model or (ii) a multidimensional OLAP (MOLAP) model.
3. The top tier is a front-end client layer, which contains query and reporting tools, analysis tools, and/or
data mining tools.
(DDW) Data shared across multiple data repositories, for the purpose of OLAP. Each data
warehouse may belong to one or many organizations. The sharing implies a common format or
definition of data elements (e.g. using XML).
Distributed data warehousing encompasses a complete enterprise DW but has smaller data stores that
are built separately and joined physically over a network, providing users with access to relevant
reports without impacting on performance.
A distributed DW, the nucleus of all enterprise data, sends relevant data to individual data marts from
which users can access information for order management, customer billing, sales analysis, and other
reporting and analytic functions.
Virtual Warehouse
28
The data warehouse is a great idea, but it is complex to build and requires investment. Why not use a
cheap and fast approach by eliminating the transformation steps of repositories for metadata and
another database.
This approach is termed the 'virtual data warehouse'. To accomplish this there is need to define 4 kinds
of information:
A data dictionary containing the definitions of the various databases.
A description of the relationship among the data elements.
The description of the way user will interface with the system.
The algorithms and business rules that define what to do and how to do it.
Disadvantages of VDW
Since compete with production data transactions, performance can be degraded.
There is no metadata, no summary data or no individual DSS (Decision Support
System) integration or history. All queries must be repeated, causing additional
burden on the system.
There is no refreshing process, causing the queries to be very complex.
29
Unit Four
Example:
A data cube is a lattice of cuboids.
Suppose that you would like to create a data cube for ALLElectronics sales that contains the following: city,
item, year, and sales_in_dollars.
You would like to be able to analyze the data, with queries such as following:
“Compute the sum of sales, grouping by city and item.”
“Compute the sum of sales, grouping by city.”
“Compute the sum of sales, grouping by item.”
30
Fig: Lattice of cuboids, making up a 3-D data cube. Each cuboid represents a different group-by. The base
cuboid contains the three dimensions city, item, and year.
Curse of Dimensionality
OLAP may need to access different cuboids for different queries.
Therefore, it may seen like a good idea to compute all or at least some of the cuboids in a data cube in
advance.
Pre-computation leads to fast response time and avoids some redundant computation.
A major challenge related to this pre-computation, however, is that the required storage space may explode if all
the cuboids in a data cube are pre-computed, especially when the cube has many dimension.
The storage requirements are even more excessive when many of the dimensions have associated
concept hierarchies, each with multiple levels.
This problem is referred to as the curse of dimensionality.
If there are many cuboids, and these cuboids are large in size, a more reasonable option is partial
materialization, that is, to materialize only some of the possible cuboids that can be generated.
Partial Materialization
There are three choices for data cube materialization given a base cuboid:
No Materialization
Do not pre-compute any of the “nonbase” cuboid. This leads to computing expensive multidimensional
aggregates on the fly, which can be extremely slow.
Full Materialization
Pre-compute all of the cuboids. The resulting lattice of computed cuboids is referred to as the full cube. This
choice typically requires huge amounts of memory space in order to store all of the pre-computed cuboids.
Partial Materialization
Selectively compute a proper subset of the whole set of possible cuboids. It represents an interesting trade-off
between storage space and response time.
31
Identify the subset of cuboids or sub-cubes to materialize
Exploit the materialized cuboids or sub-cubes during query processing
Efficiently update the materialized cuboid or sub-cubes during load and refresh.
In the bitmap index for a given attribute, there is a distinct bit vector, Bv, for each value v in the
domain of the attribute.
If the domain of a given attribute consists of n values, then n bits are needed for each entry in the
bitmap index (i.e. there are n bit vectors).
If the attribute has the value v for a given row in the data table, then the bit representing that value is
set to 1 in the corresponding row of the bitmap index.
All other bits for that row are set to 0.
The join indexing method gained popularity from its use in relational database query processing.
Join indexing registers the joinable rows of two relations from a relational database.
Join indexing is especially useful for maintaining the relationship between a foreign key and its
matching primary keys, from the joinable relation.
The star schema model of data warehouses makes join indexing attractive for crosstable search,
because the linkage between a fact table and its corresponding dimension tables comprises the foreign
key of the fact table and the primary key of the dimension table.
Estimating the costs of using the remaining materialized cuboids, and selecting the cuboid
with the least cost.
The storage model of a MOLAP server is an n-dimensional array, the front-end multidimensional
queries are mapped directly to server storage structures, which provides direct addressing capabilities.
The storage structure used by dense and sparse arrays may differ, making it advantageous to adopt a
two-level approach to MOLAP query processing.
The two-dimensional dense arrays can be indexed by B-trees.
32
Tuning and Testing of Data Warehouse
ETL or Data warehouse testing is categorized into four different engagements:
New Data Warehouse Testing – New DW is built and verified from scratch. Data input is taken from
customer requirements and different data sources and new data warehouse is build and verified with the
help of ETL tools.
Migration Testing – In this type of project customer will have an existing DW and ETL performing
the job but they are looking to bag new tool in order to improve efficiency.
Change Request – In this type of project new data is added from different sources to an existing DW. Also,
there might be a condition where customer needs to change their existing business rule or they might integrate
the new rule.
Report Testing – Report are the end result of any Data Warehouse and the basic propose for which DW is
build. Report must be tested by validating layout, data in the report and calculation.
Tuning
There is little that can be done to tune any business rules enforced by constraints. If the rules are
enforced by using SQL or by triggers code, that can needs to be tuned to maximal efficiency.
The load can be also improved by using parallelism.
The data warehouse will contain two types of query. There will be fixed queries that are clearly defined
and well understood, such as regular reports, common aggregations, etc.
Often the correct tuning choice for such eventualities will be to allow an infrequently used index or
aggregation to exist to catch just those sorts of query.
To create those sorts of indexes or aggregations, you must have an understanding that such queries are
likely to be run.
Before you can tune the data warehouse you must have some objective measures of performance to
work with. Measures such as
Average query response times
Scan rates
I/O throughput rates
Time used per query
Memory usage per process
These measure should be specified in the service level agreement (SLA).
33
Unit Five
What is KDD?
Computational theories and tools to assist humans in extracting useful information (i.e. knowledge)
from digital data
Development of methods and techniques for making sense of data
Maps low-level data into other forms that are:
more compact (i.e. short reports)
more abstract (i.e. model of process generating the data)
more useful (i.e. a predictive model of future cases)
Core of KDD process employees “data-mining”
Why KDD?
The size of datasets are growing extremely large – billions of records – hundreds of thousand of fields
Analysis of data must be automated
Computers enable us to generate amounts of data for humans to digest, thus we should use computers
to discover meaningful patterns and structures from the data
Data mining:
An application of specific algorithms for extracting patterns from data.
Data mining is a step in the KDD process.
Deviation Detection
Classification and Regression
Clustering
34
Detects sets of attributes that frequently co-occur, and rules among them, e.g. 90% of the people who buy
cookies, also buy milk (60% of all grocery shoppers buy both)
Sequence Mining (Categorical):
Discover sequences of events that commonly occur together.
Business
advertising,
Customer modeling and CRM (Customer Relationship management)
e-Commerce,
fraud detection
health care, ...
investments,
manufacturing,
sports/entertainment,
telecom (telephone and communications),
targeted marketing,
35
Traditional Data Mining Tools
Traditional data mining programs help companies establish data patterns and trends by using a number
of complex algorithms and techniques.
Some of these tools are installed on the desktop to monitor the data and highlight trends and others
capture information residing outside a database.
The majority are available in both Windows and UNIX versions, although some specialize in one
operating system only.
In addition, while some may concentrate on one database type, most will be able to handle any data
using online analytical processing or a similar technology.
Dashboards
Installed in computers to monitor information in a database, dashboards reflect data changes and
updates onscreen — often in the form of a chart or table — enabling the user to see how the business is
performing.
Historical data also can be referenced, enabling the user to see where things have changed (e.g.,
increase in sales from the same period last year).
This functionality makes dashboards easy to use and particularly appealing to managers who wish to
have an overview of the company's performance.
Text-mining Tools
The third type of data mining tool sometimes is called a text-mining tool because of its ability to mine
data from different kinds of text — from Microsoft Word and Acrobat PDF documents to simple text
files, for example.
These tools scan content and convert the selected data into a format that is compatible with the tool's
database, thus providing users with an easy and convenient way of accessing data without the need to
open different applications.
Scanned content can be unstructured (i.e., information is scattered almost randomly across the
document, including e-mails, Internet pages, audio and video data) or structured (i.e., the data's form
and purpose is known, such as content found in a database).
36
Unit Six
Data Mining Query Language
DMQL
A Data Mining Query language for Relational Databases
Create and manipulate data mining models through a SQL-based interface
Approaches differ on what kinds of models should be created, and what operations we should be able
to perform
Background knowledge
• Concept hierarchies based on attribute relationship, etc.
Various thresholds
• Minimum support, confidence, etc.
DMQL
37
Relevant attributes or aggregations
Based on these primitives, we design a query language for data mining called DMSQL (data mining
query language),
DMQL allows the ad hoc mining of several kinds of knowledge from relational databases and data
warehouses at multiple levels of abstraction.
The language adopts an SQL-like syntax, so that it can easily be integrated with the relational query
language SQL.
Data Specifications
The first step in defining a data-mining task is the specification of the task-relevant data, that is, the
data on which mining to be perform.
This involves specifying the database and tables or data warehouse containing the relevant data,
conditions for selecting the relevant data, the relevant attributes or dimensions for exploration, and
instructions regarding the order of grouping of the data retired.
Hierarchy Specification
Concept hierarchies allow the mining of knowledge at multiple levels of abstraction.
In order to accommodate the different viewpoints of users with regard to the data, there may be more
than one concept hierarchy per attribute or dimension.
For instance, some users may prefer to organize branch locations by provinces and states, while others
may prefers to organize them according to languages used.
In such cases, a user can indicate which concept hierarchy is to be used with statement use hierarchy
(hierarchy_name for {attribute_or_dimension) Otherwise, a default hierarchy per attribute or
dimension is used.
38
{(interest_measure_name)} threshold (threshold_value)
Interactive mining should allow the discovered patterns to be viewed at different concept level or from
different angles.
This can be accomplished with roll-up and drill-down operations.
Note:
We presented DMQL syntax for specifying data mining queries in terms of the five data mining
primitives.
For a given query, these primitives define the task-relevant data, the kind of knowledge to be mined,
the concept hierarchies and interestingness measures to be used, and the presentation forms for pattern
visualization.
Unit Seven
Background
Frequent patterns are patterns (such as itemsets, subsequences, or substructures) that appear in a data
set frequently.
For Example:
A set of items, such as milk and bread, that appear frequently together in a transaction data set is a
frequently itemset.
A subsequence, such as buying first a PC, the a digital camera, and then a memory card, if it occurs
frequently in a shopping history database, is a (frequent) sequential pattern.
39
Thus, frequent pattern mining has become an important data mining task and a focused theme
in data mining research.
Example:
People buying school uniforms in June also buy school bags
(People buying school bags in June also buy school uniforms)
40
Let minimum support=0.3
41
Let minimum support=0.3
42
Mining association rules using apriori
43
Unit eight
Contents
Classification maps each data elements to one of a set of pre-determined classes based on the difference among
data elements belonging to different classes.
Clustering groups data elements into different groups based on the similarity between elements with a single
group.
Definition
Classification and prediction are two forms of data analysis that can be used to extract models
describing important data classes or to predict future data trends.
Such analysis can help provide us with a better understanding of the data at large.
Whereas classification predicts categorical (discrete, unordered) labels, prediction model continuous
valued functions.
Many classification and prediction methods have been proposed by researchers in machine learning,
pattern recognition, and statistics.
Robustness
This refers to ability of the classifier or predictor to make correct predictions given noisy data or data with
missing values
Scalability
This refers to the ability to construct the classifier or predictor efficiently given large amounts of data
Interpretability
This refers to the level of understanding and insight that is provided by the classifier or predictor.
Classification Techniques
44
Classification Problem
i. Add element I to the i-1 element item-sets from the previous iteration
.2 done
Classification Techniques
Sunny Yes
Cloudy Yes/No
Overcast Yes/No
Cloudy Yes
Warm
Cloudy No
Chilly
Cloudy Yes
Pleasant
45
Overcast
Warm
Overcast No
Chilly
Overcast Yes
Pleasant
Clustering Techniques
Clustering partitions the data set into clusters or equivalence classes.
Similarity among members of a class more than similarity among members across classes.
Similarity measures: Euclidian distance or other application specific measures.
Clustering Techniques
Nearest Neighbour Clustering Algorithm:
46
Given n elements X1, X2, …. Xn, and threshold t, .
1. j ß 1, k ß 1, cluster = { }
2. Repeat
I. Find the nearest neighbour of xj
II. Let the nearest neighbour be in cluster m
III. If distance to nearest neighbour >t, then create a new cluster and k ßk+1; else assign
xj to cluster m
IV. j ßj+1
3. until j>n
Regression
Numeric prediction is the task of predicting continuous (or ordered) values for given input
For example:
We may wish to predict the salary of college graduates with 10 years of work experience, or the potential sales
of a new product given its price.
The mostly used approach for numeric prediction is regression
A statistical methodology that was developed by Sir Frances Galton (1822-1911), a mathematician who was
also a cousin of Charles Darwin
In many texts use the terms “regression” and “numeric prediction” synonymously
Regression analysis can be used to model the relationship between one or more independent or predictor
variables and a dependent or response variable (which is continuous value)
In the context of data mining, the predictor variables are the attributes of interest describing the tuple
The response variable is what we want to predict
Types of Regression
The types of Regression are as:
Linear Regression
NonLinear Regression
Linear Regression
Straight-line regression analysis involves a response variable, y, and a single predictor variable, x.
It is the simplest form of regression, and models y as a linear function of x.
That is,
y=b+wx
Where the variance of y is assumed to be constant, and b and w are regression coefficients specifying the Y-
intercept and slope of the line, respectively.
The regression coefficient, w and b, can also be thought of as weight, so that we can equivalent write,
y=w0+w1x.
The regression coefficient can be estimated using this method with the following equations:
[Refer to write board:]
Example Too:
47
Multiple regression problems are instead commonly solved with the use of statistical software packages, such as
SPSS(Statistical Package for the Social Sciences), etc..
NonLinear Regression
The straight-line linear regression case where dependent response variable, y, is modeled as a linear function of
a single independent predictor variable, x.
If we can get more accurate model using a nonlinear model, such as a parabola or some other higher-order
polynomial?
Polynomial regression is often of interest when there is just one predictor variable.
Consider a cubic polynomial relationship given by
y=w0+w1x+w2xsq2+w3xcu3
NonLinear Regression
In statistics, nonlinear regression is a form of regression analysis in which observational data are modeled by a
function which is a nonlinear combination of the model parameters and depends on one or more independent
variables. The data are fitted by a method of successive approximations.
Contents
Clustering
Definition
The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering.
A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar
to the objects in other clusters. A cluster of data objects can be treated collectively as one group and so may be
considered as a form of data compression.
First the set is partitioned into groups based on data similarity (e.g., using clustering), and then labels are
assigned to the relatively small number of groups.
It is also called unsupervised learning. Unlike classification, clustering and unsupervised learning do not rely
on predefined classes and class-labeled training examples. For this reason, clustering is a form of learning by
observation, rather than learning by examples.
Definition
Clustering is also called data segmentation in some applications because clustering partitions large data sets
into groups according to their similarity.
Clustering can also be used for outlier detection, where outliers (values that are “far away” from any cluster)
may be more interesting than common cases.
Advantages
Advantages of such a clustering-based process:
Adaptable to changes
Helps single out useful features that distinguish different groups.
Applications of Clustering
Market research
Pattern recognition
Data analysis
Image processing
Biology
Geography
Automobile insurance
Outlier detection
K-Mean Algorithm
48
Input:
k: the number of clusters,
D: a data set containing n objects.
Output: A set of k clusters.
Method:
(1) arbitrarily choose k objects from D as the initial cluster centers;
(2) repeat
(3) (re)assign each object to the cluster to which the object is the most similar, based on the mean value of the
objects in the cluster;
(4) update the cluster means, i.e., calculate the mean value of the objects for each cluster;
(5) until no change;
K-Medoids Algorithm
Input:
k: the number of clusters,
D: a data set containing n objects.
Output: A set of k clusters.
Method:
(1) arbitrarily choose k objects in D as the initial representative objects or seeds;
(2) repeat
(3) assign each remaining object to the cluster with the nearest representative object;
(4) randomly select a nonrepresentative object, orandom;
(5) compute the total cost, S, of swapping representative object, oj, with orandom;
(6) if S < 0 then swap oj with orandom to form the new set of k representative objects;
(7) until no change;
Bayesian Classification
Bayesian Classification is based on Baye’s theorem.
Studies comparing classification algorithms have found a simple Bayesian classifier known as the naïve
Bayesian classifier to be comparable in performance with decision tree and selected neural network classifiers.
Bayesian classifiers have also exhibited high accuracy and speed when applied to large database.
Naïve Bayesian classifiers assume that the effect of an attribute value on a given class is independent of the
values of the other attributes. This assumption is called class conditional independence.
Bayesian belief networks are graphical models, which unlike naïve Bayesian classifiers, allow the representation
of dependencies among subsets of attributes.
49
Bayes’ Theorem
Bayes’
Unit Nine
Contents
Spatial Data Mining
Multimedia Data Mining
Text Data Mining
Web Data Mining
Definition
Spatial data mining refers to the extraction of knowledge, spatial relationships, or other interesting patterns not
explicitly stored in spatial databases.
It is expected to have wide applications in geographic information systems, geo-marketing, remote sensing,
image database exploration, medical imaging, navigation, traffic control, environmental studies, and many other
area where spatial data are used.
Statistical spatial data analysis has been a popular approach to analyzing spatial data and exploring geographic
information.
A spatial-to-spatial dimension is a dimension whose primitive level and all of its high level generalization data
are spatial.
For example:
The dimension equi_temperature_region contains spatial data, as do all of its generalizations, such as with
regions covering 0-5_degrees(Celsius), 5-10_degrees, and so on.
50
It is important to explore data mining in raster or image databases. Methods for mining raster and image data are
examined in the following section regarding the mining of multimedia data.
Description-based retrieval is labor-intensive if performed manually. If automated, the results are typically of
poor quality.
Recently development of Web-based image clustering and classification methods has improved the quality of
description-based Web image retrieval, because image surrounded text information as well as Web linkage
information can be used to extract proper description and group images describing a similar theme together.
Content-based retrieval uses visual features to index images and promotes object retrieval based on feature
similarity, which is highly desirable in many applications
In a content-based image retrieval system, there are often two kinds of queries:
i. image-sample-based queries
ii. Image feature specification queries
Content-based retrieval has wide applications, including medical diagnosis, weather prediction, TV
production, Web Search engines for images, and e-commerce.
Image-sample-based queries find all of the images that are similar to the given image sample.
This search compares the feature vector( or signature) extracted from the sample with the feature vectors of
images that have already been extracted and indexed in the image database.
Based on this comparison, images that are closed to the sample image are returned.
Image feature specification queries specify or sketch image features like color, texture, or shape, which are
translated into a feature vector to be matched with the feature vectors of the images in the database.
Some systems, such as QBIC (Query By Image Content), support both sample-based and image feature
specification queries.
There are also systems that support both content-based and description-based retrieval.
Several approaches have been proposed and studied for similarity-based retrieval in image databases, based
on image signature:
I. Color histogram-based signature
II. Multifeature composed signature
III. Wavelet-based signature
IV. Wavelet-based signature with region-based granularity
51
Audio and Video Data Mining
Besides still images, an incommensurable amount of audiovisual information is becoming available in digital
form, in digital archives, on the World Wide Web, in broadcast data streams, and in personal and professional
database.
There are great demands for effective content-based retrieval and data mining methods for audio and video
data.
Typical examples include searching for and multimedia editing of particular video clips in a TV studio,
detecting suspicious persons or scenes in surveillance videos, finding a particular melody or tune in your MP3
audio album.
To facilitate the recording, search, and analysis of audio and video information from multimedia data, industry
and standardization committees have made great strides towards developing a set of standards for multimedia
information description and compression.
For example,
MPEG-k (developed by MPEG: Moving Picture Experts Group) and JPEG are typical video compression
schemes.
The most recently released MPEG-7, formally named “Multimedia Content Description Interface” is a standard
for describing the multimedia content data.
Text Mining
Most previous studies of data mining have focused on structured data, such as relational, transactional, and data
warehouse data.
However, in reality, a substantial portion of the available information is stored in text database ( or document
databases) as news articles, research papers, books, digital libraries, e-mail message, and Web pages.
Text databases are rapidly growing due to the increasing amount of information available in electronic form,
such as electronic publications, various kinds of electronic documents, e-mail, and WWW.
Data stored in the most text databases are semi-structured data in that they are neither completely
unstructured nor completely structured.
Text Mining
An information retrieval system often needs to trade off recall for precision or vice versa. One commonly used
trade-off is the F-score, which is defined as harmonic mean of recall and precision. It is formally defined as (see
the notation in book page nos. 616)
Precision, recall, and F-score are the basic measures of a retrieved set of documents.
52
Because of the difficulty in prescribing a user’s information need exactly with a Boolean query, the Boolean
retrieval method generally only works well when the user knows a lot about the document collection and can
formulate a good query in this way.
Document Ranking methods use the query to rank all documents in the order of relevance.
For ordinary users and exploratory queries, these methods are more appropriate than document selection
methods.
Most modern information retrieval systems present a ranked list of documents in response to user’s keyword
query.
There are many different ranking methods based on a large spectrum of mathematical foundation, including
algebra, logic, probability, and statistics.
A signature file is a file that stores a signature record for each document in the database.
Each signature has a fixed size of b bits representing terms.
A simple encoding scheme goes as follows, each bit of a document signature is initialized to 0.
A bit is set to 1 if the term it represents appears in the document.
Based on the following observation, the Web also poses great challenges for effective resource and knowledge
discovery:
The Web seems to be too huge for effective data warehousing and data mining.
The complexity of Web pages is far greater than that of any traditional text document
collection
The Web is a highly dynamic information source
The Web serves a broad diversity of user communities
Only a small portion of the information on the Web is truly relevant or useful
53
Taxonomy of Web Mining
Web content mining examines the content of Web pages as well as results of Web searching. The content
includes text as well as graphics data. Web content mining is further divided into Web page content mining and
search results mining.
Web page content mining is traditional searching of Web pages via content, while Search results mining is a
further search of pages found from a previous search.
With Web structure mining, information is obtained from the actual organization of pages on the Web.
Web usage mining looks at the logs of Web access. General access pattern tracking is a type of usage mining
that looks at a history of Web pages visited. This usage may be general or may be targeted to specific usage or
users. Usage mining also involves mining of these sequential patterns.
However, Web mining can be used to substantially enhance the power of a Web search engine.
Since Web mining may identify authoritative Web pages, classify Web documents, and resolve many
ambiguities and subtleties raised in keyword-based Web search.
In general, Web mining tasks can be classified into three categories:
Web Content Mining
Web Structure Mining
Web Usage Mining
54
55