0% found this document useful (0 votes)
87 views71 pages

Unit - 4 Final

A data warehouse is a relational database designed for analysis rather than transactions. It contains historical data from multiple sources integrated into a single view. Data warehouses use a multidimensional model to organize data into facts and dimensions for analysis. Common operations on this model include roll-ups to aggregate data, drill-downs for finer detail, slicing to filter dimensions, dicing to filter multiple dimensions, and pivoting to change the data view. This multidimensional structure allows for flexible analysis of business trends over time.

Uploaded by

SHREYA M PSGRKCW
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views71 pages

Unit - 4 Final

A data warehouse is a relational database designed for analysis rather than transactions. It contains historical data from multiple sources integrated into a single view. Data warehouses use a multidimensional model to organize data into facts and dimensions for analysis. Common operations on this model include roll-ups to aggregate data, drill-downs for finer detail, slicing to filter dimensions, dicing to filter multiple dimensions, and pivoting to change the data view. This multidimensional structure allows for flexible analysis of business trends over time.

Uploaded by

SHREYA M PSGRKCW
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

Data Mining Techniques

AP20C12
UNIT IV:

DATA WAREHOUSING: Introduction: What is a data warehouse?-Definition.


Multidimensional Data model-OLAP Operations-Warehouse Schema- Data
Warehousing Architecture- Warehouse Server-Metadata- OLAP Engine- Data
Warehouse Backend Process. Other Features.
What is a Data Warehouse?
• A Data Warehouse (DW) is a relational database that is designed for query and
analysis rather than transaction processing. It includes historical data derived
from transaction data from single and multiple sources
• A Data Warehouse provides integrated, enterprise-wide, historical data and
focuses on providing support for decision-makers for data modeling and analysis
• A Data Warehouse is a group of data specific to the entire organization, not only
to a particular group of users
• It is not used for daily operations and transaction processing but used for
making decisions
Continued…
• A Data Warehouse can be viewed as a data system with the following attributes:

• It is a database designed for investigative tasks, using data from various applications

• It supports a relatively small number of clients with relatively long interactions

• It includes current and historical data to provide a historical perspective of information

• Its usage is read-intensive

• It contains a few large tables

• Data Warehouse is a subject-oriented, integrated, and time-variant store of information in


support of management's decisions
Characteristics and Definition of Data
Warehouse:
Subject-Oriented:
• A data warehouse target on the modeling and analysis of data for decision-makers
• Therefore, data warehouses typically provide a concise and straightforward view around a
particular subject, such as customer, product, or sales, instead of the global organization's
ongoing operations
• This is done by excluding data that are not useful concerning the subject and including all data
needed by the users to understand the subject
Integrated:
A data warehouse integrates various heterogeneous data sources like RDBMS, flat files, and online
transaction records
It requires performing data cleaning and integration during data warehousing to ensure
consistency in naming conventions, attributes types, etc., among different data sources
Time-Variant
• Historical information is kept in a data warehouse
• For example, one can retrieve files from 3 months, 6 months, 12 months, or even previous data
from a data warehouse
• These variations with a transactions system, where often only the most current file is kept
Non-Volatile
• The data warehouse is a physically separate data storage, which is transformed from the
source operational RDBMS
• The operational updates of data do not occur in the data warehouse, i.e., update, insert, and
delete operations are not performed
• It usually requires only two procedures in data accessing: Initial loading of data and access to
data
• Therefore, the DW does not require transaction processing, recovery, and concurrency
capabilities, which allows for substantial speedup of data retrieval. Non-Volatile defines that
once entered into the warehouse, and data should not change
MultiDimensional Data Model
• The multi-Dimensional Data Model is a method which is used for ordering data in the database
along with good arrangement and assembling of the contents in the database
• The Multi Dimensional Data Model allows customers to interrogate analytical questions
associated with market or business trends, unlike relational databases which allow customers
to access data in the form of queries
• They allow users to rapidly receive answers to the requests which they made by creating and
examining the data comparatively fast
Continued…
• OLAP (online analytical processing) and data warehousing uses multi dimensional databases

• It is used to show multiple dimensions of the data to users

• It represents data in the form of data cubes

• Data cubes allow to model and view the data from many dimensions and perspectives

• It is defined by dimensions and facts and is represented by a fact table

• Facts are numerical measures and fact tables contain measures of the related dimensional
tables or names of the facts
Working on a Multidimensional Data Model
On the basis of the pre-decided steps, the Multidimensional Data Model works
The following stages should be followed by every project for building a Multi Dimensional Data
Model : 
Stage 1 : Assembling data from the client : In first stage, a Multi Dimensional Data Model collects
correct data from the client. Mostly, software professionals provide simplicity to the client about
the range of data which can be gained with the selected technology and collect the complete data in
detail
Stage 2 : Grouping different segments of the system : In the second stage, the Multi Dimensional
Data Model recognizes and classifies all the data to the respective section they belong to and also
builds it problem-free to apply step by step
Stage 3 : Noticing the different proportions :  In the third stage, it is the basis on which the design
of the system is based. In this stage, the main factors are recognized according to the user's point
of view. These factors are also known as "Dimensions"
Stage 4 : Preparing the actual-time factors and their respective qualities : In the fourth stage, the
factors which are recognized in the previous step are used further for identifying the related
qualities. These qualities are also known as "attributes" in the database
Stage 5 : Finding the actuality of factors which are listed previously and their qualities : In the fifth
stage, A Multi Dimensional Data Model separates and differentiates the actuality from the factors
which are collected by it. These actually play a significant role in the arrangement of a Multi
Dimensional Data Model
Stage 6 : Building the Schema to place the data, with respect to the information collected from the
steps above : In the sixth stage, on the basis of the data which was collected previously, a Schema is
built
Data Cube:
• Grouping of data in a multidimensional matrix is called data cubes
• In Data ware housing, we generally deal with various multidimensional data models as the
data will be represented by multiple dimensions and multiple attributes
• This multidimensional data is represented in the data cube as the cube represents a high-
dimensional space
• The Data cube pictorially shows how different attributes of data are arranged in the data model.
Below is the diagram of a general data cube

The example above is a 3D cube having attributes like branch(A,B,C,D),item type(home,


entertainment, computer, phone, security), year(1997,1998,1999)
Data cube classification:
• The data cube can be classified into two categories:
Multidimensional data cube: It basically helps in storing large amounts of data by making use of a
multi-dimensional array
• It increases its efficiency by keeping an index of each dimension
• Thus, dimensional is able to retrieve data fast
Relational data cube: It basically helps in storing large amounts of data by making use of relational
tables
• Each relational table displays the dimensions of the data cube
• It is slower compared to a Multidimensional Data Cube
Data Cube Operations:

Data cube operations are used to manipulate data to meet the needs of users
These operations help to select particular data for the analysis purpose
Roll-up: operation and aggregate certain similar data attributes having the same dimension
together

• For example, if the data cube displays the daily income of a customer, we can use a roll-up
operation to find the monthly income of his salary

Drill-down: this operation is the reverse of the roll-up operation

• It allows us to take particular information and then subdivide it further for coarser
granularity analysis

• It zooms into more detail

• For example- if India is an attribute of a country column and we wish to see villages in
India, then the drill-down operation splits India into states, districts, towns, cities, villages
and then displays the required information
Slicing: This operation filters the unnecessary portions
• Suppose in a particular dimension, the user doesn't need everything for analysis, rather a
particular attribute
• For example, country="jamaica", this will display only about Jamaica and only display other
countries present on the country list.
Dicing: This operation does a multidimensional cutting, that not only cuts only one dimension but
also can go to another dimension and cut a certain range of it
• For example- the user wants to see the annual salary of Jharkhand state employees
Pivot: This operation is very important from a viewing point of view
• It basically transforms the data cube in terms of view
• It doesn't change the data present in the data cube
• For example, if the user is comparing year versus branch, using the pivot operation, the user
can change the viewpoint and now compare branch versus item type
Advantages of data cubes:

• Helps in giving a summarized view of data

• Data cubes store large data in a simple way

• Data cube operation provides quick and better analysis

• Improve performance of data


For Example : 
• Let us take the example of a firm
• The revenue cost of a firm can be recognized on the basis of different factors such as
geographical location of firm's workplace, products of the firm, advertisements done, time
utilized to flourish a product, etc.
2. Let us take the example of the data of a factory which sells products per quarter in Bangalore
• The data is represented in the table given below :

• In the above given presentation, the factory's sales for Bangalore are, for the time dimension,
which is organized into quarters and the dimension of items, which is sorted according to the
kind of item which is sold
• The facts here are represented in rupees (in thousands)
• Now, if we desire to view the data of the sales in a three-dimensional table, then it is
represented in the diagram given below
• Here the data of the sales is represented as a two dimensional table
• Let us consider the data according to item, time and location (like Kolkata, Delhi, Mumbai).
Here is the table :

This data can be represented in the form of


three dimensions conceptually, which is shown
in the image below :
Data Warehouse – Schemas :
• A schema is defined as a logical description of database where fact and dimension tables are
joined in a logical manner
• Data Warehouse is maintained in the form of Star, Snow flakes, and Fact Constellation schema

Star Schema
• A Star schema contains a fact table and multiple dimension tables
• Each dimension is represented with only one-dimension table and they are not normalized
• The Dimension table contains a set of attributes
Characteristics

• In a Star schema, there is only one fact table and multiple dimension tables

• In a Star schema, each dimension is represented by one-dimension table

• Dimension tables are not normalized in a Star schema

• Each Dimension table is joined to a key in a fact table


The following illustration shows the sales data of a company with respect to the four dimensions,
namely Time, Item, Branch, and Location

• There is a fact table at the center It contains the keys to each of four dimensions
• The fact table also contains the attributes, namely dollars sold and units sold
• Note − Each dimension has only one-dimension table and each table holds a set of attributes.
For example, the location dimension table contains the attribute set {location_key, street, city,
province_or_state, country}. This constraint may cause data redundancy.
Snowflakes Schema
• Some dimension tables in the Snowflake schema are normalized
• The normalization splits up the data into additional tables as shown in the following illustration
• Unlike in the Star schema, the dimension's table in a snowflake schema are normalized
• For example − The item dimension table in a star schema is normalized and split into two
dimension tables, namely item and supplier table
• Now the item dimension table contains the attributes item_key, item_name, type, brand, and
supplier-key
• The supplier key is linked to the supplier dimension table
• The supplier dimension table contains the attributes supplier_key and supplier_type
• Note − Due to the normalization in the Snowflake schema, the redundancy is reduced and
therefore, it becomes easy to maintain and the save storage space
Fact Constellation Schema (Galaxy Schema)
• A fact constellation has multiple fact tables
• It is also known as a Galaxy Schema
• The following illustration shows two fact tables, namely Sales and Shipping
• The sales fact table is the same as that in the Star Schema
• The shipping fact table has five dimensions, namely item_key, time_key, shipper_key,
from_location, to_location
• The shipping fact table also contains two measures, namely dollars sold and units sold It is
also possible to share dimension tables between fact tables
• For example − Time, item, and location dimension tables are shared between the sales and
shipping fact table
Data Warehouse Architecture:
• A data warehouse architecture is a method of defining the overall architecture of data
communication processing and presentation that exist for end-clients computing within the
enterprise
• Each data warehouse is different, but all are characterized by standard vital components
• Production applications such as payroll accounts payable product purchasing and inventory
control are designed for online transaction processing (OLTP)
• Such applications gather detailed data from day to day operations
• Data Warehouse applications are designed to support the user ad-hoc data
requirements, an activity recently dubbed online analytical processing (OLAP)

• These include applications such as forecasting, profiling, summary reporting,


and trend analysis

• A data-warehouse is a heterogeneous collection of different data sources


organized under a unified schema

• There are 2 approaches for constructing data-warehouse:

• Top-down approach

• Bottom-up approach
The essential components are discussed below: 
1) External Sources –  External source is a source from where data is collected irrespective of
the type of data
• Data can be structured, semi structured and unstructured as well
2) Stage Area –   Since the data, extracted from the external sources does not follow a particular
format, so there is a need to validate this data to load into data warehouse
• For this purpose, it is recommended to use ETL tool
• E(Extracted): Data is extracted from External data source
• T(Transform): Data is transformed into the standard format
• L(Load): Data is loaded into datawarehouse after transforming it into the standard format
3) Data-warehouse –   After cleansing of data, it is stored in the datawarehouse as central
repository
• It actually stores the meta data and the actual data gets stored in the data marts
• Note that datawarehouse stores the data in its purest form in this top-down approach 
4) Data Marts –  Data mart is also a part of storage component
• It stores the information of a particular function of an organisation which is handled by single
authority
• There can be as many number of data marts in an organisation depending upon the functions
• We can also say that data mart contains subset of the data stored in data warehouse 
5) Data Mining –  The practice of analyzing the big data present in datawarehouse is data mining
• It is used to find the hidden patterns that are present in the database or in datawarehouse with
the help of algorithm of data mining
• This approach is defined by Inmon as – datawarehouse as a central repository for the
complete organisation and data marts are created from it after the complete data warehouse
has been created
Advantages of Top-Down Approach –   
• Since the data marts are created from the datawarehouse, provides consistent dimensional
view of data marts
• Also, this model is considered as the strongest model for business changes
• That's why, big organizations prefer to follow this approach
• Creating data mart from datawarehouse is easy
Disadvantages of Top-Down Approach –  
• The cost, time taken in designing and its maintenance is very high
 
 Bottom-Up Approach: 
First, the data is extracted from external sources (same as happens in top-down approach)
Then, the data go through the staging area (as explained above) and loaded into data marts instead
of data warehouse
• The data marts are created first and provide reporting capability. It addresses a single business
area
These data marts are then integrated into data warehouse 
• This approach is given by Kinball as – data marts are created first and provides a thin view
for analyses and datawarehouse is created after complete data marts have been created
Advantages of Bottom-Up Approach:
1. As the data marts are created first, so the reports are quickly generated
2. We can accommodate more number of data marts here and in this way data warehouse can
be extended
3. Also, the cost and time taken in designing this model is low comparatively
Disadvantage of Bottom-Up Approach:
1. This model is not strong as top-down approach as dimensional view of data marts is not
consistent as it is in above approach
Metadata :
• Metadata is simply defined as data about data

• The data that is used to represent other data is known as metadata

• For example, the index of a book serves as a metadata for the contents in the book

• In other words, we can say that metadata is the summarized data that leads us to detailed data

In terms of data warehouse, we can define metadata as follows.

• Metadata is the road-map to a data warehouse

• Metadata in a data warehouse defines the warehouse objects and it acts as a directory

• This directory helps the decision support system to locate the contents of a data warehouse
Categories of Metadata
Metadata can be broadly categorized into three categories −
Business Metadata − It has the data ownership information, business definition, and changing
policies
Technical Metadata − It includes database system names, table and column names and sizes, data
types and allowed values. Technical metadata also includes structural information such as
primary and foreign key attributes and indices
Operational Metadata − It includes currency of data and data lineage. Currency of data means
whether the data is active, archived, or purged. Lineage of data means the history of data migrated
and transformation applied on it
Role of Metadata
• Metadata has a very important role in a data warehouse
• The role of metadata in a warehouse is different from the warehouse data, yet it plays an
important role
The various roles of metadata are explained below.
• Metadata acts as a directory
• This directory helps the decision support system to locate the contents of the data warehouse
• Metadata helps in decision support system for mapping of data when data is transformed from
operational environment to data warehouse environment
• Metadata helps in summarization between current detailed data and highly summarized data
• Metadata also helps in summarization between lightly detailed data and highly summarized
data
• Metadata is used for query tools
• Metadata is used in extraction and cleansing tools
• Metadata is used in reporting tools
• Metadata is used in transformation tools
• Metadata plays an important role in loading functions
The following diagram shows the roles of metadata
Metadata Repository
Metadata repository is an integral part of a data warehouse system. It has the following metadata
Definition of data warehouse − It includes the description of structure of data warehouse The description is
defined by schema, view, hierarchies, derived data definitions, and data mart locations and contents
Business metadata − It contains has the data ownership information, business definition, and changing
policies
Operational Metadata − It includes currency of data and data lineage. Currency of data means whether the
data is active, archived, or purged. Lineage of data means the history of data migrated and transformation
applied on it
Data for mapping from operational environment to data warehouse − It includes the source databases and
their contents, data extraction, data partition cleaning, transformation rules, data refresh and purging rules
Algorithms for summarization − It includes dimension algorithms, data on granularity, aggregation,
summarizing, etc
Challenges for Metadata Management
• The importance of metadata can not be overstated. Metadata helps in driving the accuracy of
reports, validates data transformation, and ensures the accuracy of calculations. Metadata also
enforces the definition of business terms to business end-users. With all these uses of
metadata, it also has its challenges. Some of the challenges are discussed below
• Metadata in a big organization is scattered across the organization. This metadata is spread in
spreadsheets, databases, and applications
• Metadata could be present in text files or multimedia files. To use this data for information
management solutions, it has to be correctly defined
• There are no industry-wide accepted standards. Data management solution vendors have
narrow focus
• There are no easy and accepted methods of passing metadata
OLAP Hierarchical Structure / Types of OLAP
ROLAP
ROLAP works with data that exist in a relational database
Facts and dimension tables are stored as relational tables
It also allows multidimensional analysis of data and is the fastest growing OLAP
Advantages of ROLAP model:
High data efficiency. It offers high data efficiency because query performance and access language
are optimized particularly for the multidimensional data analysis
Scalability. This type of OLAP system offers scalability for managing large volumes of data, and
even when the data is steadily increasing
Drawbacks of ROLAP model:

• Demand for higher resources: ROLAP needs high utilization of manpower, software,


and hardware resources

• Aggregately data limitations. ROLAP tools use SQL for all calculation of aggregate data

• However, there are no set limits to the for handling computations

• Slow query performance. Query performance in this model is slow when compared


with MOLAP
Hybrid OLAP

• Hybrid OLAP is a mixture of both ROLAP and MOLAP

• It offers fast computation of MOLAP and higher scalability of ROLAP. HOLAP uses
two databases

• Aggregated or computed data is stored in a multidimensional OLAP cube

• Detailed information is stored in a relational database


ROLAP vs. MOLAP:
The following arguments can be given in favour of MOLAP:

• Relational tables are unnatural for multidimensional data

• Multidimensional arrays provide efficiency in storage and operations

• There is a mismatch between multidimensional operations and SQL

• For ROLAP to achieve efficiency, it has to perform outside current relational


systems, which is the same as what MOLAP does

You might also like