introduction to DataWarehouse and DataMining
introduction to DataWarehouse and DataMining
Data Warehouse:
Data warehouse is like a relational database designed for analytical needs. It functions on the
basis of OLAP (Online Analytical Processing). It is a central location where consolidated data
from multiple locations (databases) are stored.
Now, if we want to view the sales data with a third dimension, For example, suppose the data
according to time and item, as well as the location is considered for the cities Chennai, Kolkata,
Mumbai, and Delhi. These 3D data are shown in the table. The 3D data of the table are
represented as a series of 2D tables.
Conceptually, it may also be represented by the same data in the form of a 3D data cube, as
shown in fig:
What is Schema?
Schema is a logical description of the entire database.
It includes the name and description of records of all record types including all
associated data-Items and aggregates.
Much like a database, a data warehouse also requires to maintain a schema.
A database uses relational model, while a data warehouse uses Star, Snowflake, and
Fact Constellation schema.
Modeling data warehouses: dimensions & measures
Star schema: A fact table in the middle connected to a set of dimension tables
Snowflake schema: A refinement of star schema where some dimensional hierarchy is
normalized into a set of smaller dimension tables, forming a shape similar to snowflake
Fact constellations: Multiple fact tables share dimension tables, viewed as a collection
of stars, therefore called galaxy schema or fact constellation
Star Schema:
A star schema is the elementary form of a dimensional model, in which data are
organized into facts and dimensions.
A fact is an event that is counted or measured, such as a sale or log in. A dimension
includes reference data about the fact, such as date, item, or customer.
A star schema is a relational schema where a relational schema whose design represents
a multidimensional data model.
The star schema is the explicit data warehouse schema. It is known as star
schema because the entity-relationship diagram of this schemas simulates a star, with
points, diverge from a central table.
The center of the schema consists of a large fact table, and the points of the star are the
dimension tables.
There is a fact table at the center. It contains the keys to each of four dimensions.
The fact table also contains the attributes, namely dollars sold and units sold.
Each dimension has only one dimension table and each table holds a set of attributes.
For example, the location dimension table contains the attribute set {location_key,
street, city, province_or_state,country}. This constraint may cause data redundancy. For
example, "Vancouver" and "Victoria" both the cities are in the Canadian province of
British Columbia. The entries for such cities may cause data redundancy along the
attributes province_or_state and country.
Snowflake Schema:
Some dimension tables in the Snowflake schema are normalized.
The normalization splits up the data into additional tables.
Unlike Star schema, the dimensions table in a snowflake schema are normalized.
For example, the item dimension table in star schema is normalized and split into two
dimension tables, namely item and supplier table.
Now the item dimension table contains the attributes item_key, item_name, type,
brand, and supplier-key.
The supplier key is linked to the supplier dimension table. The supplier dimension table
contains the attributes supplier_key and supplier_type.
Note : Due to normalization in the Snowflake schema, the redundancy is reduced and
therefore, it becomes easy to maintain and the save storage space.
Advantages:
(i) Less redundancies due to normalization Dimension Tables.
(ii) Dimension Tables are easier to update.
Disadvantages:
It is complex schema when compared to star schema.
This schema defines two fact tables, sales, and shipping. Sales are treated along four
dimensions, namely, time, item, branch, and location.
The schema contains a fact table for sales that includes keys to each of the four
dimensions, along with two measures: Rupee_sold and units_sold.
The shipping table has five dimensions, or keys: item_key, time_key, shipper_key,
from_location, and to_location, and two measures: Rupee_cost and units_shipped.
It is also possible to share dimension tables between fact tables. For example, time,
item, and location dimension tables are shared between the sales and shipping fact
table.
Disadvantages:
(i) Complex due to multiple fact tables
(ii) It is difficult to manage
(iii) Dimension Tables are very large.
OLAP OPERATIONS:
In the multidimensional model, the records are organized into various dimensions, and
each dimension includes multiple levels of abstraction described by concept hierarchies.
This organization support users with the flexibility to view data from various
perspectives.
A number of OLAP data cube operation exist to demonstrate these different views,
allowing interactive queries and search of the record at hand. Hence, OLAP supports a
user-friendly environment for interactive data analysis.
Consider the OLAP operations which are to be performed on multidimensional data.
The data cubes for sales of a shop. The cube contains the dimensions, location, and time
and item, where the location is aggregated with regard to city values, time is
aggregated with respect to quarters, and an item is aggregated with respect to item
types.
OLAP having 5 different operations
(i) Roll-up
(ii) Drill-down
(iii) Slice
(iv) Dice
(v) Pivot
Roll-up:
The roll-up operation performs aggregation on a data cube, by climbing down concept
hierarchies, i.e., dimension reduction. Roll-up is like zooming-out on the data cubes.
It is also known as drill-up or aggregation operation
Figure shows the result of roll-up operations performed on the dimension location. The
hierarchy for the location is defined as the Order Street, city, province, or state, country.
The roll-up operation aggregates the data by ascending the location hierarchy from the
level of the city to the level of the country.
When a roll-up is performed by dimensions reduction, one or more dimensions are
removed from the cube.
For example, consider a sales data cube having two dimensions, location and time. Roll-
up may be performed by removing, the time dimensions, appearing in an aggregation of
the total sales by location, relatively than by location and by time.
Fig: Roll-up operation on Data Cube
Drill-Down
The drill-down operation is the reverse operation of roll-up.
It is also called roll-down operation.
Drill-down is like zooming-in on the data cube.
It navigates from less detailed record to more detailed data. Drill-down can be
performed by either stepping down a concept hierarchy for a dimension or adding
additional dimensions.
Figure shows a drill-down operation performed on the dimension time by stepping
down a concept hierarchy which is defined as day, month, quarter, and year.
Drill-down appears by descending the time hierarchy from the level of the quarter to a
more detailed level of the month.
Because a drill-down adds more details to the given data, it can also be performed by
adding a new dimension to a cube.
Fig: Drill-down operation
Slice:
A slice is a subset of the cubes corresponding to a single value for one or more members
of the dimension.
The slice operation provides a new sub cube from one particular dimension in a given
cube.
For example, a slice operation is executed when the customer wants a selection on one
dimension of a three-dimensional cube resulting in a two-dimensional site. So, the Slice
operations perform a selection on one dimension of the given cube, thus resulting in a
sub cube.
Here Slice is functioning for the dimensions "time" using the criterion time = "Q1".
It will form a new sub-cubes by selecting one or more dimensions.
Fig: Slice operation
Dice:
The dice operation describes a sub cube by operating a selection on two or more
dimension.
Fig: Dice operation
The dice operation on the cubes based on the following selection criteria involves three
dimensions.
(location = "Toronto" or "Vancouver")
(time = "Q1" or "Q2")
(item =" Mobile" or "Modem")
Pivot:
The pivot operation is also called a rotation.
Pivot is a visualization operations which rotates the data axes in view to provide an
alternative presentation of the data.
It may contain swapping the rows and columns or moving one of the row-dimensions
into the column dimensions.
(Data base Management System)Software that controls the organization, storage, retrieval, security and
integrity of data in a database.
The major DBMS vendors are Oracle, IBM, Microsoft and Sybase (see Oracle Database, DB2, SQL Server
and ASE).
3. Data Mart
It is a subset of a DWH that supports a particular department, region, or business unit. Consider
this: You have multiple departments, including sales, marketing, product development, etc.
Each department will have a central repository where it stores data. This repository is called a
data mart. The EDW stores the data from the data mart in the ODS on a daily/weekly (or as
configured) basis. The ODS acts as a staging area for data integration. It then sends the data to
the EDW to store it and use it for BI purposes.
DATAWAREHOUSE COMPONENTS:
The data warehouse is based on an RDBMS server which is a central information repository that
is surrounded by some key components to make the entire environment functional,
manageable and accessible
There are mainly five components of Data Warehouse:
METADATA:
The name Meta Data suggests some high- level technological concept. However, it is quite
simple. Metadata is data about data which defines the data warehouse. It is used for building,
maintaining and managing the data warehouse.
In the Data Warehouse Architecture, meta-data plays an important role as it specifies the
source, usage, values, and features of data warehouse data. It also defines how data can be
changed and processed. It is closely connected to the data warehouse.
QUERY TOOLS:
One of the primary objects of data warehousing is to provide information to businesses to
make strategic decisions. Query tools allow users to interact with the data warehouse system.
These tools fall into four different categories:
Query and reporting tools
Application Development tools
Data mining tools
OLAP tools
Characteristics of OLAP:
The main characteristics of OLAP are as follows:
Multidimensional conceptual view: OLAP systems let business users have a dimensional and
logical view of the data in the data warehouse. It helps in carrying slice and dice operations.
Multi-User Support: Since the OLAP techniques are shared, the OLAP operation should provide
normal database operations, containing retrieval, update, adequacy control, integrity, and
security.
Accessibility: OLAP acts as a mediator between data warehouses and front-end. The OLAP
operations should be sitting between data sources (e.g., data warehouses) and an OLAP front-
end.
Storing OLAP results: OLAP results are kept separate from data sources.
Uniform documenting performance: Increasing the number of dimensions or database size
should not significantly degrade the reporting performance of the OLAP system.
OLAP provides for distinguishing between zero values and missing values so that aggregates are
computed correctly.
OLAP system should ignore all missing values and compute correct aggregate values.
OLAP facilitate interactive query and complex analysis for the users.
OLAP allows users to drill down for greater details or roll up for aggregations of metrics along a
single business dimension or across multiple dimension.
OLAP provides the ability to perform intricate calculations and comparisons.
OLAP presents results in a number of meaningful ways, including charts and graphs.
OLAP Types:
Three types of OLAP servers are:-
1. Relational OLAP (ROLAP)
2. Multidimensional OLAP (MOLAP)
3. Hybrid OLAP (HOLAP)
Advantages of ROLAP:
1. ROLAP can handle large amounts of data.
2. Can be used with data warehouse and OLTP systems.
Disadvantages of ROLAP:
1. Limited by SQL functionalities.
2. Hard to maintain aggregate tables.
Advantages of MOLAP
1. Optimal for slice and dice operations.
2. Performs better than ROLAP when data is dense.
3. Can perform complex calculations.
Disadvantages of MOLAP
1. Difficult to change dimension without re-aggregation.
2. MOLAP can handle limited amount of data.
Advantages of HOLAP
1. HOLAP provide advantages of both MOLAP and ROLAP.
2. Provide fast access at all levels of aggregation.
Disadvantages of HOLAP
1. HOLAP architecture is very complex because it support both MOLAP and ROLAP servers.
CH.YADAVENDRA KUMAR Asst.Prof Sri Mittapalli Institute of Technology for Women
Market Analysis
Fraud Detection
Customer Retention
Production Control
Science Exploration
Knowledge discovery from Data (KDD) is essential for data mining. While others view data
mining as an essential step in the process of knowledge discovery. Here is the list of steps
3. DataWarehouse
A datawarehouse is defined as the collection of data integrated from multiple
sources that will query and decision making.
There are three types of datawarehouse: Enterprise datawarehouse, Data
Mart and Virtual Warehouse.
Two approaches can be used to update data in DataWarehouse: Query-
driven Approach and Update-driven Approach.
Application: Business decision making, Data mining, etc.
4. Transactional Databases
Transactional databases are a collection of data organized by time stamps, date,
etc to represent transaction in databases.
This type of database has the capability to roll back or undo its operation when a
transaction is not completed or committed.
Highly flexible system where users can modify information without changing any
sensitive information.
Follows ACID property of DBMS.
Application: Banking, Distributed systems, Object databases, etc.
5. Multimedia Databases
Multimedia databases consists audio, video, images and text media.
They can be stored on Object-Oriented Databases.
They are used to store complex information in pre-specified formats.
Application: Digital libraries, video-on demand, news-on demand, musical
database, etc.
6. Spatial Database
Store geographical information.
Stores data in the form of coordinates, topology, lines, polygons, etc.
Application: Maps, Global positioning, etc.
7. Time-series Databases
Time series databases contain stock exchange data and user logged activities.
Handles array of numbers indexed by time, date, etc.
It requires real-time analysis.
Application: eXtremeDB, Graphite, InfluxDB, etc.
8. WWW
WWW refers to World wide web is a collection of documents and resources like
audio, video, text, etc which are identified by Uniform Resource Locators (URLs)
through web browsers, linked by HTML pages, and accessible via the Internet
network.
It is the most heterogeneous repository as it collects data from multiple resources.
It is dynamic in nature as Volume of data is continuously increasing and changing.
Application: Online shopping, Job search, Research, studying, etc.
a) Descriptive Function
The descriptive function deals with the general properties of data in the database.
1. Class/Concept Description
2. Mining of Frequent Patterns
3. Mining of Associations
4. Mining of Correlations
5. Mining of Clusters
1. Class/Concept Description
Class/Concept refers to the data to be associated with the classes or concepts. For
example, in a company, the classes of items for sales include computer and printers, and
concepts of customers include big spenders and budget spenders. Such descriptions of a class
or a concept are called class/concept descriptions. These descriptions can be derived by the
fol
Data Characterization
class under study is called as Target Class.
Data Discrimination
predefined group or class.
4. Mining of Correlations
It is a kind of additional analysis performed to uncover interesting statistical
correlations between associated-attribute-value pairs or between two item sets to analyze
that if they have positive, negative or no effect on each other.
5. Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming
group of objects that are very similar to each other but are highly different from the objects
in other clusters.
2. Prediction
than class labels. Regression Analysis is generally used for prediction. Prediction can
also be used for identification of distribution trends based on available data.
3. Decision Trees
and leaf nodes. Each internal node denotes a test on an attribute, each branch denotes
the outcome of a test, and each leaf node holds a class label.
5. Neural Networks
processing. These models are biologically inspired rather than an exact replica of how
the brain actually functions. Neural networks have been shown to be very promising
systems in many forecasting applications and business classification applications due
learn data.
6. Outlier Analysis
with the general behavior or model of the data available.
7. Evolution Analysis
regularities or trends for objects whose behavior changes over time.
Note
1. Statistics:
It uses the mathematical analysis to express representations, model and summarize
empirical data or real world observations.
Statistical analysis involves the collection of methods, applicable to large amount of
data to conclude and report the trend.
2. Machine learning
Arthur Samuel defined machine learning as a field of study that gives computers the
ability to learn without being programmed.
When the new data is entered in the computer, algorithms help the data to grow or
change due to machine learning.
In machine learning, an algorithm is constructed to predict the data from the available
database (Predictive analysis).
It is related to computational statistics.
Intrusion Detection
Intrusion refers to any kind of action that threatens integrity, confidentiality, or the
availability of network resources. In this world of connectivity, security has become the
major issue. With increased usage of internet and availability of the tools and tricks for
intruding and attacking network prompted intrusion detection to become a critical component
of network administration.
Major Issues in data mining:
Data mining is a dynamic and fast-expanding field with great strengths. The major issues
can divided into five groups:
a) Mining Methodology
b) User Interaction
c) Efficiency and scalability
d) Diverse Data Types Issues
e) Data mining society
a) MiningMethodology:
attributes.
Example:
Attribute Values
Colors Black, Green, Brown, red