Unit II - DW&DM
Unit II - DW&DM
The multi-Dimensional Data Model is a method which is used for ordering data in the database along with
good arrangement and assembling of the contents in the database.
The Multi Dimensional Data Model allows customers to interrogate analytical questions associated with
market or business trends, unlike relational databases which allow customers to access data in the form of
queries. They allow users to rapidly receive answers to the requests which they made by creating and
examining the data comparatively fast.
OLAP (online analytical processing) and data warehousing uses multi dimensional databases. It is used to
show multiple dimensions of the data to users.
It represents data in the form of data cubes. Data cubes allow to model and view the data from many
dimensions and perspectives. It is defined by dimensions and facts and is represented by a fact table. Facts
are numerical measures and fact tables contain measures of the related dimensional tables or names of the
facts.
On the basis of the pre-decided steps, the Multidimensional Data Model works.
The following stages should be followed by every project for building a Multi Dimensional Data Model :
Stage 1 : Assembling data from the client : In first stage, a Multi Dimensional Data Model collects correct
data from the client. Mostly, software professionals provide simplicity to the client about the range of data
which can be gained with the selected technology and collect the complete data in detail.
Stage 2 : Grouping different segments of the system : In the second stage, the Multi Dimensional Data
Model recognizes and classifies all the data to the respective section they belong to and also builds it
problem-free to apply step by step.
Stage 3 : Noticing the different proportions : In the third stage, it is the basis on which the design of the
system is based. In this stage, the main factors are recognized according to the user’s point of view. These
factors are also known as “Dimensions”.
Stage 4 : Preparing the actual-time factors and their respective qualities : In the fourth stage, the factors
which are recognized in the previous step are used further for identifying the related qualities. These
qualities are also known as “attributes” in the database.
Stage 5 : Finding the actuality of factors which are listed previously and their qualities : In the fifth
stage, A Multi Dimensional Data Model separates and differentiates the actuality from the factors which
are collected by it. These actually play a significant role in the arrangement of a Multi Dimensional Data
Model.
Stage 6 : Building the Schema to place the data, with respect to the information collected from the steps
above : In the sixth stage, on the basis of the data which was collected previously, a Schema is built.
For Example :
1. Let us take the example of a firm. The revenue cost of a firm can be recognized on the basis of different
factors such as geographical location of firm’s workplace, products of the firm, advertisements done,
time utilized to flourish a product, etc.
2. Let us take the example of the data of a factory which sells products per quarter in Bangalore. The data
is represented in the table given below :
2D factory data
In the above given presentation, the factory’s sales for Bangalore are, for the time dimension, which is
organized into quarters and the dimension of items, which is sorted according to the kind of item which is
sold. The facts here are represented in rupees (in thousands).
Now, if we desire to view the data of the sales in a three-dimensional table, then it is represented in the
diagram given below. Here the data of the sales is represented as a two dimensional table. Let us consider
the data according to item, time and location (like Kolkata, Delhi, Mumbai). Here is the table :
3D data representation as 2D
This data can be represented in the form of three dimensions conceptually, which is shown in the image
below :
3D data representation
1. Requirements analysis and capacity planning: The first process in data warehousing
involves defining enterprise needs, defining architectures, carrying out capacity planning,
and selecting the hardware and software tools. This step will contain be consulting senior
management as well as the different stakeholder.
2. Hardware integration: Once the hardware and software has been selected, they require
to be put by integrating the servers, the storage methods, and the user software tools.
4. Physical modeling: For the data warehouses to perform efficiently, physical modeling
is needed. This contains designing the physical data warehouse organization, data
placement, data partitioning, deciding on access techniques, and indexing.
5. Sources: The information for the data warehouse is likely to come from several data
sources. This step contains identifying and connecting the sources using the gateway,
ODBC drives, or another wrapper.
6. ETL: The data from the source system will require to go through an ETL phase. The
process of designing and implementing the ETL phase may contain defining a suitable
ETL tool vendors and purchasing and implementing the tools. This may contains
customize the tool to suit the need of the enterprises.
7. Populate the data warehouses: Once the ETL tools have been agreed upon, testing the
tools will be needed, perhaps using a staging area. Once everything is working
adequately, the ETL tools may be used in populating the warehouses given the schema
and view definition.
8. User applications: For the data warehouses to be helpful, there must be end-user
applications. This step contains designing and implementing applications required by the
end-users.
9. Roll-out the warehouses and applications: Once the data warehouse has been populated
and the end-client applications tested, the warehouse system and the operations may be
rolled out for the user's community to use.
Implementation Guidelines
2. Need a champion: A data warehouses project must have a champion who is active to
carry out considerable researches into expected price and benefit of the project. Data
warehousing projects requires inputs from many units in an enterprise and therefore needs
to be driven by someone who is needed for interacting with people in the enterprises and
can actively persuade colleagues.
4. Ensure quality: The only record that has been cleaned and is of a quality that is implicit
by the organizations should be loaded in the data warehouses.
5. Corporate strategy: A data warehouse project must be suitable for corporate strategies
and business goals. The purpose of the project must be defined before the beginning of
the projects.
6. Business plan: The financial costs (hardware, software, and peopleware), expected
advantage, and a project plan for a data warehouses project must be clearly outlined and
understood by all stakeholders. Without such understanding, rumors about expenditure
and benefits can become the only sources of data, subversion the projects.
7. Training: Data warehouses projects must not overlook data warehouses training
requirements. For a data warehouses project to be successful, the customers must be
trained to use the warehouses and to understand its capabilities.
8. Adaptability: The project should build in flexibility so that changes may be made to the
data warehouses if and when required. Like any system, a data warehouse will require to
change, as the needs of an enterprise change.
9. Joint management: The project must be handled by both IT and business professionals
in the enterprise. To ensure that proper communication with the stakeholder and which
the project is the target for assisting the enterprise's business, the business professional
must be involved in the project along with technical professionals.
Financial services
Banking services
Consumer goods
Retail sectors
Controlled manufacturing
Types of Data Warehouse
Information processing, analytical processing, and data mining are the three types of data warehouse
applications that are discussed below −
Information Processing − A data warehouse allows to process the data stored in it. The data
can be processed by means of querying, basic statistical analysis, reporting using crosstabs,
tables, charts, or graphs.
Analytical Processing − A data warehouse supports analytical processing of the information
stored in it. The data can be analyzed by means of basic OLAP operations, including slice-
and-dice, drill down, drill up, and pivoting.
Data Mining − Data mining supports knowledge discovery by finding hidden patterns and
associations, constructing analytical models, performing classification and prediction. These
mining results can be presented using the visualization tools.
Roll-up
Drill-down
Slice and dice
Pivot (rotate)
Roll-up
Roll-up performs aggregation on a data cube in any of the following ways −
By climbing up a concept hierarchy for a dimension
By dimension reduction
The following diagram illustrates how roll-up works.
What is OLAM?
OLAM stands for Online analytical mining. It is also known as OLAP Mining. It integrates online
analytical processing with data mining and mining knowledge in multi-dimensional databases. There are
several paradigms and structures of data mining systems.
Various data mining tools must work on integrated, consistent, and cleaned data. This requires costly pre-
processing for data cleaning, data transformation, and data integration. Thus, a data warehouse constructed
by such pre-processing is a valuable source of high-quality information for both OLAP and data mining.
Data mining can serve as a valuable tool for data cleaning and data integration.
OLAM is particularly important for the following reasons which are as follows −
High quality of data in data warehouses − Most data mining tools are required to work on integrated,
consistent, and cleaned information, which needs costly data cleaning, data integration, and data
transformation as a pre-processing phase. A data warehouse constructed by such pre-processing serves as a
valuable source of high-quality data for OLAP and data mining. Data mining can also serve as a valuable
tool for data cleaning and data integration.
Available information processing infrastructure surrounding data warehouses − Comprehensive data
processing and data analysis infrastructures have been or will be orderly constructed surrounding data
warehouses, which contains accessing, integration, consolidation, and transformation of various
heterogeneous databases, ODBC/OLE DB connections, Web-accessing and service facilities, and
documenting and OLAP analysis tools. It is careful to create the best use of the available infrastructures
instead of constructing everything from scratch.
OLAP-based exploratory data analysis − Effective data mining required exploratory data analysis. A user
will be required to traverse through a database, select areas of relevant information, analyze them at
multiple granularities, and display knowledge/results in multiple forms.
Online analytical mining supports facilities for data mining on multiple subsets of data and at several levels
of abstraction, by drilling, pivoting, filtering, dicing, and slicing on a data cube and some intermediate data
mining outcomes.
On-line selection of data mining functions − It supports a user who cannot understand what type of
knowledge they would like to mine. By integrating OLAP with various data mining functions, online
analytical mining provides users with the flexibility to choose desired data mining functions and swap data
mining tasks dynamically.
On Line Transaction Processing (OLTP) System in DBMS
On-Line Transaction Processing (OLTP) System refers to the system that manage transaction oriented
applications. These systems are designed to support on-line transaction and process query quickly on the
Internet.
For example: POS (point of sale) system of any supermarket is a OLTP System.
Every industry in today’s world use OLTP system to record their transactional data. The main concern of
OLTP systems is to enter, store and retrieve the data. They covers all day to day operations such as
purchasing, manufacturing, payroll, accounting, etc.of an organization. Such systems have large numbers
of user which conduct short transaction. It supports simple database query so the response time of any user
action is very fast.
The data acquired through an OLTP system is stored in commercial RDBMS, which can be used by an
OLAP System for data analytics and other business intelligence operations.
Some other examples of OLTP systems include order entry, retail sales, and financial transaction systems.
Advantages of an OLTP System:
OLTP Systems are user friendly and can be used by anyone having basic understanding
It allows its user to perform operations like read, write and delete data quickly.
It responds to its user actions immediately as it can process query very quickly.
This systems are original source of the data.
It helps to administrate and run fundamental business tasks
It helps in widening customer base of an organization by simplifying individual processes
Challenges of an OLTP system:
It allows multiple users to access and change the same data at the same time. So it requires
concurrency control and recovery mechanism to avoid any unprecedented situations
The data acquired through OLTP systems are not suitable for decision making. OLAP systems
are used for the decision making or “what if” analysis.
Type of queries that an OLTP system can Process:
An OLTP system is an online database modifying system. So it supports database query like INSERT,
UPDATE and DELETE information from the database. Consider a POS system of a supermarket, Below are
the sample queries that it can process –
Retrieve the complete description of a particular product
Filter all products related to any particular supplier
Search for the record of any particular customer.
List all products having price less than Rs 1000.
Advantages:
1. It can find useful information which is not visible in simple data browsing
2. It can find interesting association and correlation among data items
Disadvantages:
Data mining has different types of patterns and frequent pattern mining is one of them. This concept was
introduced for mining transaction databases. Frequent patterns are patterns(such as items, subsequences,
or substructures) that appear frequently in the database. It is an analytical process that finds frequent
patterns, associations, or causal structures from databases in various databases. This process aims to find
the frequently occurring item in a transaction. By frequent patterns, we can identify strongly correlated
items together and we can identify similar characteristics and associations among them. By doing frequent
data mining we can go further for clustering and association.
Frequent data mining can be done by using association rules with particular algorithms eclat and apriori
algorithms. Frequent pattern mining searches for recurring relationships in a data set. It also helps to find
the inheritance regularities. to make fast processing software with a user interface and used for a long time
without any error.
Association Rule Mining:
It is easy to find associations in frequent patterns:
for each frequent pattern x for each subset y c x.
calculate the support of y-> x – y.
if it is greater than the threshold, keep the rule. There are two algorithms that support this lattice
1. Apriori algorithm
2. eclat algorithm
Apriori Eclat
working principle (it is a simple point of scale application for any supermarket which has a good off-
product scale)
the product data will be entered into the database.
the taxes and commissions are entered.
the product will be purchased and it will be sent to the bill counter.
the bill calculating operator will check the product with the bar code machine it will check and
match the product in the database and then it will show the information of the product.
the bill will be paid by the customer and he will receive the products.
A frequent pattern, it meets the minimum support criteria. All super patterns of a closed pattern are less
frequent than the closed pattern.
Max Pattern:
It also meets the minimum support criteria(like a closed pattern). All super patterns of a max pattern are
not frequent patterns. both patterns generate fewer numbers of patterns so therefore they increase the
efficiency of the task.
basket data analysis, cross-marketing, catalog design, sale campaign analysis, web log analysis, and DNA
sequence analysis.
Issues of frequent pattern mining
flexibility and reusability for creating frequent patterns
most of the algorithms used for mining frequent item sets do not offer flexibility for reusing
much research is needed to reduce the size of the derived patterns
Frequent pattern mining has several applications in different areas, including:
Market Basket Analysis: This is the process of analyzing customer purchasing patterns in order
to identify items that are frequently bought together. This information can be used to optimize
product placement, create targeted marketing campaigns, and make other business decisions.
Recommender Systems: Frequent pattern mining can be used to identify patterns in user
behavior and preferences in order to make personalized recommendations.
Fraud Detection: Frequent pattern mining can be used to identify abnormal patterns of behavior
that may indicate fraudulent activity.
Network Intrusion Detection: Network administrators can use frequent pattern mining to detect
patterns of network activity that may indicate a security threat.
Medical Analysis: Frequent pattern mining can be used to identify patterns in medical data that
may indicate a particular disease or condition.
Text Mining: Frequent pattern mining can be used to identify patterns in text data, such as
keywords or phrases that appear frequently together in a document.
Web usage mining: Frequent pattern mining can be used to analyze patterns of user behavior on
a website, such as which pages are visited most frequently or which links are clicked on most
often.
Gene Expression: Frequent pattern mining can be used to analyze patterns of gene expression
in order to identify potential biomarkers for different diseases.
These are a few examples of the application of frequent pattern mining. The list is not exhaustive and the
technique can be applied in many other areas, as well.
Correlation Analysis in Data Mining
Correlation analysis is a statistical method used to measure the strength of the linear relationship between
two variables and compute their association. Correlation analysis calculates the level of change in one
variable due to the change in the other. A high correlation points to a strong relationship between the two
variables, while a low correlation means that the variables are weakly related.
Researchers use correlation analysis to analyze quantitative data collected through research methods like
surveys and live polls for market research. They try to identify relationships, patterns, significant
connections, and trends between two variables or datasets. There is a positive correlation between two
variables when an increase in one variable leads to an increase in the other. On the other hand, a negative
correlation means that when one variable increases, the other decreases and vice-versa.
Correlation is a bivariate analysis that measures the strength of association between two variables and the
direction of the relationship. In terms of the strength of the relationship, the correlation coefficient's value
varies between +1 and -1. A value of ± 1 indicates a perfect degree of association between the two variables.
As the correlation coefficient value goes towards 0, the relationship between the two variables will be
weaker. The coefficient sign indicates the direction of the relationship; a + sign indicates a positive
relationship, and a - sign indicates a negative relationship.